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Abstract 


Distributed  systems  experience  and  should  tolerate  faults  beyond  simple  component  crashes  as  such 
systems  grow  in  size  and  importance.  Unfortunately,  tolerating  arbitrary  faults,  also  known  as 
Byzantine  faults,  poses  several  challenges  to  system  designers,  often  limiting  performance,  requir¬ 
ing  additional  hardware,  or  both.  This  dissertation  presents  new  protocols  that  provide  substantially 
better  performance  than  previously  demonstrated.  The  Byzantine  fault-tolerant  erasure-coded  block 
storage  protocol  proposed  in  this  thesis  provides  40%  higher  write  throughput  than  the  best  prior 
approach.  The  Byzantine  fault-tolerant  replicated  state  machine  provides  a  factor  of  2.2-2.9  times 
higher  throughput  than  the  best  prior  approach.  Furthermore,  the  protocols  presented  in  this  dis¬ 
sertation  require  25-33%  fewer  responsive  servers  than  the  nearest  competitors.  To  enable  these 
results,  this  dissertation  introduces  several  new  techniques,  including  homomorphic  fingerprinting, 
partial  encoding,  and  Byzantine  Locking,  that  provide  unprecedented  scalability,  higher  throughput, 
lower  latency,  and  lower  computational  overhead.  This  dissertation  also  considers  new  methods  for 
analyzing  the  correctness  of  distributed  systems  in  the  presence  of  faulty  clients.  Distributed  ser¬ 
vices  and  storage  systems  built  using  these  techniques  can  provide  Byzantine  fault  tolerance  in  a 
more  efficient,  higher  performance,  and  more  scalable  manner  than  previously  thought  possible. 


V 


VI 


Acknowledgments 


I’d  like  to  thank  the  members  and  alumni  of  the  PDL  and  CyLab  for  their  feedback,  patience, 
and  friendship,  including  Michael  Abd-El-Malek,  James  Hoe,  Elie  Kr'evat,  Michael  Mesnier, 
Bryan  Parno,  Adrian  Perrig,  Brandon  Salmon,  Raja  Sambasivan,  Shafeeq  Sinnamohideen,  Matthew 
Wachs,  Theodore  Wong,  and  Jay  Wylie.  I’d  like  to  acknowledge  Greg  for  providing  support 
throughout  my  tenure  at  Carnegie  Mellon,  Mike  for  his  feedback  on  my  proofs  and  arguments, 
and  Miguel  and  Priya  for  providing  feedback,  insight,  and  support  throughout  my  dissertation.  I’d 
also  like  to  thank  the  members  and  companies  of  the  CyEab  Corporate  Partners  and  the  PDE  Con¬ 
sortium  (including  APC,  Cisco,  DataDomain,  EMC,  Eacebook,  Google,  Hewlett-Packard,  Hitachi, 
IBM,  Intel,  ESI,  Microsoft,  NEC,  NetApp,  Oracle,  Panasas,  Seagate,  Sun,  Symantec,  and  VMware) 
for  their  interest,  insights,  feedback,  and  support. 


viii 


Contents 


1  Introduction  1 

1.1  Thesis  Statement .  2 

1.2  Verifying  Distributed  Erasure-Coded  Data .  3 

1.3  Correctness  in  the  Presence  of  Faulty  Clients  .  3 

1.4  Low-Overhead  Byzantine  Fault- Tolerant  Storage  .  4 

1.5  Scalable  Fault  Tolerance  through  Byzantine  Locking .  4 

2  Homomorphic  Fingerprinting:  Verifying  Distributed  Erasure-Coded  Data  7 

2. 1  Homomorphic  Fingerprinting .  8 

2.1.1  Fingerprinting  .  8 

2.1.2  Homomorphism  .  10 

2.1.3  Applications  to  Erasure  Codes .  12 

2.2  Fingerprinted  Cross-checksum .  13 

2.3  Example:  Improving  AVID .  15 

2.3.1  Avid .  15 

2.3.2  AVID-FP .  15 

2.3.3  AVID-FP  Pseudo-code .  16 

2.3.4  AVID-FP  Correctness .  17 

2.4  Performance .  18 

2.5  Other  Protocols .  19 

2.6  Related  Work .  20 

2.7  Conclusion .  20 

3  The  Correctness  of  Distributed  Systems  in  the  Presence  of  Faulty  Clients  21 

3.1  Background .  22 

3.2  Safety  in  the  Presence  of  Faulty  Clients  .  23 

3.3  Stricter  Extensions  of  Linearizability .  25 

3.3.1  Invocation  Criteria .  25 

3.3.2  Recovery .  25 

3.4  A  Wait-free  Storage  Protocol  with  Immediate  Recovery .  27 

3.4.1  Write .  27 

3.4.2  Read  .  28 

3.4.3  Revoke .  28 


IX 


X 


CONTENTS 


3.4.4  Linearizability  and  Immediate  Recovery .  29 

3.5  Conclusion .  30 

4  Low-Overhead  Byzantine  Fault-Tolerant  Storage  31 

4.1  Background .  32 

4.1.1  Beyond  Crash  Faults .  32 

4.1.2  The  Cost  of  Byzantine  Fault  Tolerance .  33 

4.1.3  Byzantine  Fault-Tolerant  Storage  .  33 

4.2  The  FP  Protocol .  34 

4.2.1  Design  .  35 

4.2.2  System  Model  .  38 

4.2.3  Detailed  Pseudo-code .  39 

4.2.4  Correctness .  43 

4.3  Implementation .  48 

4.4  Evaluation .  49 

4.4.1  Competing  Protocols .  49 

4.4.2  Experimental  Setup .  51 

4.4.3  Write  Throughput .  51 

4.4.4  Read  Throughput .  53 

4.4.5  Response  Time .  54 

4.5  Conclusion .  55 

5  Scalable  Fault  Tolerance  through  Byzantine  Locking  57 

5. 1  Context  and  Related  Work .  59 

5.1.1  The  Byzantine  Efficiency  Race .  59 

5.1.2  How  Zzyzx  Eits  In .  61 

5.1.3  Prior  Byzantine  Eault- Tolerant  Replicated  State  Machine  Protocols  ....  62 

5.1.4  Additional  Related  Work .  62 

5.2  Definitions  and  System  Model .  63 

5.3  Byzantine  Eocking  and  Zzyzx .  64 

5.3.1  The  Zyzzyva  Interface  and  Eocking .  66 

5.3.2  The  Eog  Interface .  67 

5.3.3  Handling  Contention .  68 

5.4  Protocol  Details .  70 

5.4.1  Checkpointing  and  State  Transfer .  71 

5.4.2  Optimizations .  71 

5.4.3  Scalability  Through  Eog  Server  Groups .  72 

5.5  Evaluation .  72 

5.5.1  Assumptions  and  Eimitations  .  73 

5.5.2  Experimental  Setup .  73 

5.5.3  Scalability  .  74 

5.5.4  Throughput .  74 

5.5.5  Eatency .  75 

5.5.6  Performance  with  /  Slow  Servers .  76 


CONTENTS 


XI 


5.5.7  Performance  Under  Contention  .  76 

5.5.8  Postmark  and  Trace-driven  Execution .  77 

5.6  Conclusion .  78 

6  Conclusion  79 

A  The  Correctness  of  Byzantine  Locking  81 

A.  1  Sequential  Specifications  of  Relevant  Objects .  81 

A.  1.1  The  Object-Based  State  Machine  Object .  82 

A.  1.2  The  Log  Object .  83 

A.  1.3  The  Manager  Object .  83 

A.2  Linearizability .  84 

A.2.1  The  Reads-from  Relation .  84 

A.2.2  The  Equivalence  of  Replayed  Requests .  84 

A.2. 3  Requests  that  are  not  Replayed .  87 

A.2.4  Real-time  precedence:  Reads-from  Strict .  88 

A.2.5  The  Einearizability  of  Exec  and  Append .  89 

A.  3  The  Client  and  The  Primary  .  89 

A.3.1  Reads-from  Valid  and  Reads-from  Strict .  90 

A.3.2  The  Primary  .  90 

A.3.3  The  Client  .  92 

A.3.4  Eiveness  .  92 

A. 3.5  Obstruction-Eree  Variants  .  94 

B  Zzyzx  Optimizations  95 

B. l  Eaulty  Client  Isolation  .  95 

B.2  Separating  UNLOCK  from  Eetch .  95 

B.3  Append,  Unlock,  and  Eetch .  96 

B.4  Import .  97 

B.5  Retrying  if  Append  Returns  failure .  98 

B.6  State  Transfer  and  Next.Vs .  99 

B. 6.1  missed_reqs(. . . ) .  99 

B.6.2  Next_Vs  and  get_view(. . . ) .  99 

B.6. 3  Replaying  Requests  at  the  Eog  Object .  100 

B.6. 4  Avoiding  Replay .  100 

B.6.5  Garbage  Collection .  101 

B.7  Using  MACs  Instead  of  Signatures  in  Append  .  101 


CONTENTS 


List  of  Figures 


2.1.1  Encoding  a  vector  .  12 

2.3.2  Avid-fp  pseudo-code .  17 

4.2.1  Pseudo-code  outline .  38 

4.2.2  Write  pseudo-code .  40 

4.2.3  Read  pseudo-code .  42 

4.4.4  Write  throughput  as  a  function  of  faults  tolerated .  52 

4.4.5  Read  throughput  as  a  function  of  faults  tolerated .  53 

4.4.6  Write  response  time  as  a  function  of  faults  tolerated .  54 

4.4.7  Write  response  time  breakdown  for  /  =  10 .  54 

4.4.8  Read  response  time  as  a  function  of  faults  tolerated .  55 

4.4.9  Read  response  time  breakdown  for  /  =  10 .  55 

5.0.1  Scalability .  58 

5.1.2  Comparison  of  protocols .  60 

5.3.3  Zzyzx  components .  65 

5.3.4  Basic  communication  pattern  of  Zzyzx  versus  Zyzzyva .  67 

5.3.5  Unlocking .  68 

5.5.6  Throughput  vs.  clients .  74 

5.5.7  Latency  vs.  Throughput .  75 

5.5.8  Throughput  under  faults .  76 

5.5.9  Throughput  vs.  contention .  77 

5.5.10  File  system  evaluation .  78 

A.3.1  Primary  pseudo-code .  90 

A.  3. 2  Client  pseudo-code .  91 

B. 3.1  Log  server  pseudo-code .  96 


xiii 


LIST  OFHGURES 


Chapter  1 

Introduction 


As  distributed  systems  grow  in  size  and  importance,  they  experience  and  should  tolerate  faults  be¬ 
yond  simple  component  crashes.  Distributed  systems  deployed  in  real  environments  experience  a 
variety  of  faults,  such  as  network  misbehavior,  storage  failures,  and  software  faults.  Network  prob¬ 
lems  include  message  timeouts  due  to  temporary  overload,  network  partitions,  or  packet  corruption. 
Faulty  physical  storage  may  coiTupt  data  or  fail  to  make  writes  durable,  such  that  a  future  read  re¬ 
turns  stale  data.  Finally,  race  conditions  in  software,  drivers,  or  firmware  may  result  in  transient 
invalid  results. 

Though  monolithic  servers  and  simple  redundancy  are  adequate  for  many  applications,  the 
largest  and  most  critical  applications  must  survive  in  ever  harsher  environments.  Less  synchronous 
networking  delivers  packets  unreliably  and  unpredictably,  and  more  faulty  hardware  and  software 
lose  data,  corrupt  data,  and  provide  stale  data  with  greater  frequency.  In  response  to  this  deterio¬ 
rating  situation,  distributed  protocols  have  experienced  a  natural  progression  over  the  years,  from 
monolithic  servers  to  replicated  services,  from  requiring  synchronous  environments  to  allowing 
asynchrony,  and  from  tolerating  crashes  to  tolerating  some  corruptions  through  ad-hoc  consistency 
checks.  Ad-hoc  consistency  checks,  however,  may  not  capture  important  failure  modes,  providing 
the  impetus  for  a  more  formalized  fault  model. 

Protocols  that  can  tolerate  arbitrarily  faulty  behavior  by  components  of  the  system  are  said  to 
be  Byzantine  fault-tolerant.  Ideally,  systems  should  tolerate  Byzantine  faulty  clients  or  servers. 
Byzantine  fault  tolerance  ensures  that  all  bases  are  covered,  protecting  against  misdirected  writes, 
soft  errors,  and  other  faults  and  corruptions  found  in  modern  hai'dware  and  software.  Though  Byzan¬ 
tine  fault  tolerance  can  protect  against  obscure  or  unlikely  faults,  such  as  a  malicious  insider,  it  is 
important  to  remember  that  Byzantine  fault  tolerance  also  protects  against  more  mundane  yet  still 
perplexing  faults  that  ad-hoc  consistency  checks  may  miss. 

Unfortunately,  Byzantine  fault-tolerance  is  generally  believed  to  be  too  expensive  to  justify  in 
practice.  This  dissertation  presents  Byzantine  fault-tolerant  protocols  for  building  storage  systems 
and  distributed  services  that  provide  much  better  performance  and  scalability  than  prior  approaches. 
It  develops  new  techniques  to  reduce  the  cost  and  improve  the  scalability  and  performance  of  Byzan¬ 
tine  fault-tolerant  distributed  systems.  The  techniques  described  in  this  dissertation  can  be  used  to 
build  Byzantine  fault-tolerant  storage  systems  and  services  that  are  more  efficient,  higher  perfor¬ 
mance,  and  more  scalable  than  previously  thought  possible. 
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CHAPTER  1.  INTRODUCTION 


1.1  Thesis  Statement 

Thesis  Statement:  Scalable  cluster-based  storage  systems  can  tolerate  Byzantine  faults 
with  substantially  lower  overhead  than  previously  demonstrated.  In  particular,  an  erasure-coded 
Byzantine  fault-tolerant  block  storage  system  can  provide  bandwidth  on  par  with  protocols  that 
tolerate  only  crashes,  and  a  Byzantine  fault-tolerant  metadata  service  can  provide  scalability  and 
respond  to  most  requests  in  a  single  round  trip,  even  when  only  the  minimal  number  of  servers  are 
responsive. 

To  support  this  thesis  statement,  this  disseration  takes  the  following  steps.  First,  it  develops  a  new 
cryptographic  primitive,  which  shows  that  erasure-coded  data  can  be  efficiently  verified  in  a  dis¬ 
tributed  system.  Second,  it  develops  a  new  Byzantine  fault-tolerant  storage  protocol  and  proves 
its  correctness,  which  showes  that  the  performance  of  a  Byzantine  fault-tolerant  block  storage  pro¬ 
tocol  can  be  competitive  in  theory.  Third,  this  dissertation  describes  a  prototype  implementations 
of  the  storage  protocol  and  competing  protocols,  including  protocols  that  tolerate  only  crashes.  It 
then  measures  and  compares  of  the  performance  of  the  prototypes  experimentally,  which  shows 
that  the  performance  of  a  Byzantine  fault-tolerant  protocol  can  be  competitive  in  practice  with  pro¬ 
tocols  that  tolerate  only  crashes.  Fourth,  this  dissertation  develops  a  new  Byzantine  fault-tolerant 
replicated  state  machine  protocol,  proves  its  correctness,  describes  a  prototype  implementation,  and 
measures  the  prototype  against  competing  prototypes  to  demonstrate  its  performance  and  scalability 
properties. 

Contributions:  This  dissertation  makes  four  primary  contributions.  First,  it  introduces  a  new 
cryptographic  primitive,  homomorphic  fingerprinting,  which  can  be  used  to  verify  that  distributed 
erasure-coded  data  was  properly  encoded.  This  technique  allows  many  replication-based  data  stor¬ 
age  and  distribution  protocols  to  be  modified  to  accept  erasure-coded  data.  Second,  this  dissertation 
proposes  a  correctness  condition  in  the  presence  of  faulty  clients  that  allows  reasoning  about  con¬ 
current  Byzantine-tolerant  objects  in  terms  of  their  sequential  specification  and  encapsulates  the 
semantics  of  the  protocol  in  the  presence  of  faulty  clients. 

Third,  this  dissertation  introduces  several  techniques  to  improve  the  performance  of  Byzantine 
fault-tolerant  erasure-coded  storage  systems,  and  it  provides  a  protocol  that  substantially  outper¬ 
forms  prior  approaches.  For  example,  the  partial  encoding  optimization  eliminates  about  half  of  the 
necessary  erasure  coding  in  a  distributed  storage  system.  A  prototype  implementation  demonstrates 
experimentally  that  Byzantine  fault-tolerant  erasure-coded  block  storage  protocols  can  provide  sim¬ 
ilar  throughput  and  latency  as  protocols  that  tolerate  only  crashes. 

Fourth,  this  dissertation  proposes  a  new  technique  for  building  Byzantine  fault-tolerant  repli¬ 
cated  state  machines,  Byzantine  Locking.  Byzantine  Locking  provides  unprecedented  scalability 
and  efficiency  for  the  common  case  of  infrequent  concurrent  data  sharing.  Byzantine  Locking  is 
used  to  build  Zzyzx,  a  Byzantine  fault-tolerant  replicated  state  machine  prototype  that  substan¬ 
tially  outperforms  prior  approaches.  Experiments  with  Zzyzx  demonstrate  that  Byzantine  fault- 
tolerant  replicated  state  machines  need  only  the  minimal  number  of  responsive  servers  to  ensure 
high  throughput,  provide  single  roundtrip  latency,  and  provide  scalability  thr'ough  workload  parti¬ 
tioning  when  the  workload  exhibits  low  object  contention. 
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The  next  four  sections  provide  an  overview  of  these  four  contributions  and  the  corresponding  chap¬ 
ter  that  describes  each  contribution  in  detail.  Each  of  the  chapters  also  discusses  the  background 
and  related  work  relevant  to  the  corresponding  contribution. 

1.2  Verifying  Distributed  Erasure- Coded  Data 

Erasure  coding  can  reduce  the  space  and  bandwidth  overheads  of  redundancy  in  fault-tolerant  data 
storage  and  delivery  systems.  An  m-of-n  erasure  code  encodes  a  block  of  data  into  n  fragments, 
each  1  / the  size  of  the  original  block,  such  that  any  m  can  be  used  to  reconstruct  the  original 
block.  Thus,  {n  —  m)  of  the  fragments  can  be  unavailable  (e.g.,  due  to  corruption  or  server  failure) 
without  loss  of  access.  But,  it  introduces  the  fundamental  difficulty  of  ensuring  that  all  erasure- 
coded  fragments  correspond  to  the  same  block  of  data.  Without  such  assurance,  a  different  block 
may  be  reconstructed  from  different  subsets  of  fragments.  Previous  systems  in  which  clients  cannot 
be  trusted  to  encode  and  distribute  data  correctly  use  one  of  two  approaches.  In  the  first  approach, 
servers  are  provided  the  entire  block  of  data,  allowing  them  to  agree  on  the  contents  and  generate 
their  own  fragments  [18,  19].  Savings  are  achieved  for  storage,  but  bandwidth  overheads  are  no 
better  than  for  replication.  In  the  second  approach,  clients  verify  all  n  fragments  when  they  perform 
a  read  to  ensure  that  no  other  client  could  observe  a  different  value  [44],  a  significant  computational 
overhead. 

Chapter  2  proposes  homomorphic  fingerprinting,  a  new  technique  that  provides  this  assurance 
without  the  bandwidth  and  computational  overheads  associated  with  current  approaches.  The  core 
idea  is  to  distribute  homomorphic  fingerprints  with  each  fragment,  which  preserve  the  structure 
of  the  erasure  code  and  allow  each  fragment  to  be  independently  verified  as  corresponding  to  a 
specific  block.  The  key  insight  is  that  the  coding  scheme  imposes  certain  algebraic  constraints 
on  the  fragments,  and  that  there  exist  homomorphic  fingerprinting  functions  that  preserve  these 
constraints.  Chapter  2  presents  homomorphic  fingerprinting  functions  that  are  secure,  efficient,  and 
compact. 

1.3  Correctness  in  the  Presence  of  Faulty  Clients 

A  distributed  system  is  correct  if  it  faithfully  implements  the  desired  protocol  specification.  Un¬ 
fortunately,  providing  correctness  arguments  is  difficult  in  the  presence  of  faulty  clients  that  may 
disobey  the  protocol.  Because  faulty  clients  can  affect  the  state  of  the  system  by  following  the  pro¬ 
tocol,  their  actions  must  be  considered.  But,  because  they  may  not  follow  the  protocol,  predicates 
needed  to  prove  correctness  may  not  be  well  defined.  There  are  several  approaches  to  proving  cor¬ 
rectness  with  faulty  clients  in  the  literature,  but  prior  approaches  prohibit  some  protocols  that  are 
sufficiently  correct  for  real  applications  or  allow  protocols  that  may  not  be  sufficient  for  some  real 
applications. 

Many  Byzantine  fault-tolerant  protocols  tolerate  Byzantine  faulty  clients  as  well  as  servers. 
Most  protocols  are  proven  to  guarantee  some  variant  of  linearizability  [51].  In  particular,  in  the 
absence  of  faulty  clients,  the  execution  of  such  protocols  should  result  in  a  linearizable  history  of 
events.  Unfortunately,  the  original  definition  of  linearizability  does  not  apply  in  the  presence  of 
faulty  clients,  and  there  is  no  agreed-upon  definition  that  applies  in  the  presence  of  faulty  clients. 
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Chapter  3  proposes  a  minimal  correctness  condition,  as  well  as  two  stronger  conditions  that  allow 
for  easier  analysis  and  provide  useful  guarantees.  This  definition  will  prove  useful  in  Chapter  4, 
which  uses  invocation  criteria,  one  of  the  two  stronger  conditions,  to  reason  about  the  correctness 
of  a  Byzantine  fault-tolerant  storage  protocol. 

1.4  Low-Overhead  Byzantine  Fault- Tolerant  Storage 

Unlike  replicated  state  machine  protocols,  which  inherently  require  more  servers  to  tolerate  Byzan¬ 
tine  rather  than  crash  faults,  erasure-coded  storage  protocols  can  tolerate  Byzantine  faults  with  the 
same  number  of  servers  used  to  tolerate  crashes.  Given  an  erasure  code  where  m  fragments  are 
required  to  reconstruct  a  block,  tolerating  /  crash  faults  or  Byzantine  faults  in  an  asynchronous  en¬ 
vironment  requires  writing  fragments  to  ni + f  servers  (assuming  m  >  f)  out  of  m  +  2/  total  servers. 
Real-world  storage  systems  have  recently  begun  to  tolerate  faults  other  than  crashes,  but  it  is  un¬ 
clear  which  faults  such  systems  should  tolerate.  Thus,  the  Byzantine  fault  model  would  be  of  great 
interest  to  the  distributed  storage  community,  if  shown  to  be  sufficiently  efficient. 

Chapter  4  presents  an  erasure-coded  Byzantine  fault-tolerant  block  storage  protocol  that  is 
nearly  as  efficient  as  protocols  that  tolerate  only  crashes.  Previous  Byzantine  fault-tolerant  block 
storage  protocols  have  either  relied  upon  replication,  which  is  inefficient  for  large  blocks  of  data 
when  tolerating  multiple  faults,  or  a  combination  of  additional  servers,  extra  computation,  and  ver¬ 
sioned  storage.  To  avoid  these  expensive  techniques,  the  protocol  employs  novel  mechanisms  to 
optimize  for  the  common  case  when  faults  and  concurrency  are  rare.  In  the  common  case,  a  write 
operation  completes  in  two  rounds  of  communication  and  a  read  completes  in  one  round.  The  pro¬ 
tocol  requires  a  short  checksum  comprised  of  cryptographic  hashes  and  homomorphic  fingerprints. 
It  achieves  throughput  within  10%  of  the  crash-tolerant  protocol  for  writes  and  reads  in  failure-free 
runs  when  configured  to  tolerate  up  to  6  faulty  servers  and  any  number  of  faulty  clients. 

1.5  Scalable  Fault  Tolerance  through  Byzantine  Locking 

Chapter  5  presents  Zzyzx,  a  Byzantine  fault-tolerant  replicated  state  machine  that  outperforms  prior 
approaches  and  provides  an  unprecedented  feature:  near-linear  scaling  of  throughput  by  adding 
servers.  Using  a  new  technique  called  Byzantine  Locking,  Zzyzx  allows  a  client  to  extract  state 
from  an  underlying  replicated  state  machine  and  access  it  via  a  second  protocol  specialized  for  use 
by  a  single  client.  This  second  protocol  requires  just  one  round-trip  and  2/-I-1  responsive  servers — 
compared  to  Zyzzyva,  this  results  in  39-43%  lower  response  times  and  a  factor  of  2.2-2.9x  higher 
throughput.  More  importantly,  the  extracted  state  can  be  transferred  to  other  servers,  allowing  non¬ 
overlapping  sets  of  servers  to  manage  Zzyzx  allows  throughput  to  be  scaled  by  adding  servers  when 
concurrent  data  sharing  is  not  common.  When  data  sharing  is  common,  performance  can  match  that 
of  the  underlying  replicated  state  machine  protocol  (e.g.,  Zyzzyva). 

Byzantine  fault-tolerant  replicated  state  machine  protocols  require  3/  -|-  1  servers  in  an  asyn¬ 
chronous  network  environment,  of  which  2/  -I-  1  must  be  responsive  and  involved  in  each  request. 
Of  recent  proposals,  only  H/Q  [32]  and  PBFT  [24]  achieve  this  minimum,  while  Q/U  [2]  requires 
4/  -|-  1  servers  to  be  responsive  and  Zyzzyva  [55]  requires  3/  -|-  1  servers  to  be  responsive.  The 
ideal  metadata  protocol  would  requires  only  2/  -|-  1  responsive  replicas,  as  in  H/Q  and  PBFT,  but 
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could  complete  requests  in  as  few  roundtrips  as  possible  in  the  common  case,  as  in  Q/U  or  Zyzzyva. 
Zzyzx  achieves  both  goals  when  contention  is  rare,  which  is  the  case  for  many  important  workloads, 
such  as  metadata  (e.g.,  see  GPFS  [92]).  A  detailed  proof  of  the  correctness  of  Byzantine  Locking 
in  general,  and  Zzyzx  in  specific,  is  provided  in  Appendix  A. 


6 


CHAPTER  1.  INTRODUCTION 


Chapter  2 

Homomorphic  Fingerprinting:  Verifying 
Distributed  Erasure-Coded  Data 


Erasure  coding  can  reduce  the  space  and  bandwidth  overheads  of  redundancy  in  fault-tolerant  data 
storage  and  delivery  systems.  An  m-of-n  erasure  code  encodes  a  block  of  data  into  n  fragments,  each 
1  / the  size  of  the  original  block,  such  that  any  m  can  be  used  to  reconstruct  the  original  block. 
Thus,  {n  —  m)  of  the  fragments  can  be  unavailable  (e.g.,  due  to  corruption  or  server  failure)  with¬ 
out  loss  of  access.  Example  erasure  coding  schemes  with  these  properties  include  Reed-Solomon 
codes  [85]  and  Rabin’s  Information  Dispersal  Algorithm  [84]. 

Unfortunately,  erasure  coding  creates  a  fundamental  challenge:  determining  if  a  given  fragment 
indeed  corresponds  to  a  specific  original  block.  If  this  is  not  ensured  for  each  fragment,  then  recon¬ 
structing  from  different  subsets  of  fragments  may  result  in  different  blocks,  violating  any  reasonable 
definition  of  data  consistency. 

Systems  in  which  clients  cannot  be  trusted  to  encode  and  distribute  data  correctly  use  one  of  two 
approaches.  In  the  first  approach,  servers  are  provided  the  entire  block  of  data,  allowing  them  to 
agree  on  the  contents  and  generate  their  own  fragments  [18,  19].  Savings  are  achieved  for  storage, 
but  bandwidth  overheads  are  no  better  than  for  replication.  In  the  second  approach,  clients  verify 
all  n  fragments  when  they  perform  a  read  to  ensure  that  no  other  client  could  observe  a  different 
value  [44].  In  this  approach,  each  fragment  is  accompanied  by  a  cross-checksum  [43,  58],  which 
consists  of  a  hash  of  each  of  the  n  fragments.  A  reader  verifies  fhe  cross-checksum  by  reconsfrucf- 
ing  a  block  from  m  fragmenls  and  fhen  recomputing  fhe  ofher  {n  —  m)  fragmenfs  and  comparing 
fheir  hash  values  fo  fhe  corresponding  enfries  in  fhe  cross-checksum,  a  significanf  compufafional 
overhead. 

This  chapfer  develops  a  new  approach,  in  which  each  fragmenl  is  accompanied  by  a  sef  of 
fingerprinfs  fhaf  allows  each  server  fo  independenfly  verify  fhaf  ifs  fragmenf  was  generafed  from 
fhe  original  value.  The  key  insighf  is  fhaf  fhe  coding  scheme  imposes  cerfain  algebraic  consfrainfs 
on  fhe  fragmenfs,  and  fhaf  fhere  exisf  homomorphic  fingerprinfing  functions  fhaf  preserve  fhese 
consfrainfs.  Servers  can  verify  fhe  infegrify  of  fhe  erasure  coding  as  evidenced  by  fhe  fingerprinfs, 
agreeing  upon  a  parficular  sef  of  encoded  fragmenfs  wifhouf  ever  needing  fo  see  fhem.  Thus,  fhe 
fwo  common  approaches  described  above  could  be  used  wifhouf  fhe  bandwidfh  or  compufafion 
overheads,  respecfively. 
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The  proposed  fingerprinting  functions  belong  to  a  family  of  universal  hash  functions  [20],  cho¬ 
sen  to  preserve  the  underlying  algebraic  constraints  of  the  fragments.  A  particular  fingerprinting 
function  is  chosen  at  random  with  respect  to  the  fragments  being  fingerprinted.  This  “random”  se¬ 
lection  can  be  deterministic  with  the  appropriate  application  of  a  cryptographic  hash  function  [12]. 
If  data  is  represented  carefully,  the  remainder  from  division  by  a  random  irreducible  polynomial  [83] 
or  the  evaluation  of  a  polynomial  at  a  random  point  preserve  the  needed  algebraic  structure.  The 
resulting  fingerprints  are  secure,  efficient,  and  compact. 

The  rest  of  this  chapter  is  organized  as  follows.  Section  2.1  provides  a  formal  definition  of 
homomorphic  fingerprinting  along  with  two  such  functions,  the  division  fingerprinting  and  the  eval¬ 
uation  fingerprinting  functions.  Section  2.2  describes  a  data  structure  called  a  fingerprinted  cross¬ 
checksum,  against  which  the  integrity  of  a  fragment  can  be  verified.  Section  2.3  demonstrates  how 
homomorphic  fingerprinting  can  improve  distributed  protocols  by  improving  the  bandwidth  over¬ 
head  of  the  Avid  protocol  [18].  Section  2.4  measures  the  performance  of  this  approach.  Section  2.5 
considers  other  protocols,  and  Section  2.6  surveys  related  work. 


2.1  Homomorphic  Fingerprinting 

This  section  defines  homomorphic  fingerprinting  and  its  applications  to  erasure  codes.  Section  2.1.1 
defines  fingerprinting,  providing  two  examples:  division  and  evaluation  fingerprinting.  Section  2.1.2 
defines  homomorphic  fingerprinting  and  shows  that  both  division  and  evaluation  fingerprinting  are 
homomorphic  fingerprinting  functions.  Section  2.1.3  explains  the  applications  of  homomorphic 
fingerprinting  functions  to  erasure  codes. 

Throughout  this  chapter,  let  IF  denote  a  finite  field  with  operators  “+”  and  and  let  IF^t  denote 

such  a  field  of  order  where  q  is  prime.  Let  r  <—  T  denote  selection  of  an  element  from  T  uniformly 
at  random  and  its  assignment  to  t. 


2.1.1  Fingerprinting 

Definition  2.1.1.  An  e-fingerprinting  function  fingerprint :  A!  x  ^  F'*'  satisfies 


max  Pr 

d,d'e¥^ 

d^d’ 


fingerprint(r,r/)  =  fingerprint(r,r/') 


<  e 


In  words,  the  probability  under  random  selection  of  r  that  fingerprint(r,r/)  =  fingerprint(r,r/')  is  at 
most  s. 

Let  F^*  [x]  denote  the  set  of  polynomials  with  coefficients  in  F^t,  with  “-I-”  and  defined  as  in 
normal  polynomial  arithmetic.  A  vector  d  G  F^*  of  6  elements  of  F^*  has  a  natural  representation  as 

a  polynomial  d  {x)  G  F^*  [x]  of  degree  less  than  6  with  coefficients  in  F^t  where  the  element  of  d 
is  the  coefficient  in  d  (x)  of  degree  j,  where  0  <  y  <  5.  These  notations  will  be  used  interchangeably, 
denoting  d  as  d{x)  when  it  assumes  this  form. 

Example  2.1.2.  [Rabin  fingerprinting]  Let  F2  denote  a  field  of  order  2,  let  A!  =  {2,3,4, . . .  ,2"''}, 
and  let  P2  :  A  — >  F2[x]  be  a  deterministic  algorithm  that  outputs  monic  irreducible  polynomials  of 
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prime  degree  y  with  coefficients  in  F2  such  that 


Pr 


p{x)  =P2{r) 


=  Pr 


p{x)  =P2{r) 


for  all  p{x),p'{x)  G  F2[;c]  of  degree  y.  That  is,  P2  selects  monic  degree-y  irreducible  polynomials 
uniformly  at  random,  where  probabilities  are  taken  with  respect  to  the  uniformly  random  selection 
of  r.  Rabin  showed  that  fingerprint :  X.  X  F2  ^  F2  defined  by 

fingerprint(r,ri(;t))  :  p{x)  ^  P2{r)', 

return  {d{x)  mod  p{x)) 


is  an  s-fingerprinting  function  for  s  =  2^  [83]. 

Theorem  2.1.3.  [Division  fingerprinting]  Let  F^^  denote  a  field  of  order  let  the  size  of  N. 
be  the  number  of  monic  in'educible  polynomials  of  degree  y  with  coefficients  in  F^t,  and  let  P^k  : 
OC  —>  F^i  [x]  be  a  deterministic  algorithm  that  outputs  monic  irreducible  polynomials  of  degree  y 
with  coefficients  in  F^i  chosen  uniformly  at  random,  with  probabilities  taken  over  the  choice  of 
input  r  G  uniformly  at  random.  Then  fingerprint(r,r/)  :  ^  x  F^j.  — F^^  defined  by 

fingerprint(r,ri(.r))  :  p{x)  ^  Pqk{r)\ 

return  {d{x)  mod  p{x)) 

is  an  s-fingerprinting  function  for  s  =  — 


Proof.  As  in  [83],  this  is  because  there  are  at  least  ^ monic  degree-y  irreducible  polynomials 
with  coefficients  in  F^t  [102],  of  which  any  nonzero  degree-5  polynomial  with  coefficients  in  W^k 
may  have  at  most  factors  of  degree-y.  Consider  the  difference  of  any  two  distinct  polynomials 
with  matching  fingerprints.  Let  d{x),d'{x)  G  F^t[.r]  and  d{x)  =  d'{x)  mod  p{x)  but  d{x)  /  d'{x). 
Then  {d{x)  —  d'{x))  =  0  mod  p{x),  so  p{x)  is  a  factor  of  {d{x)  —  d'{x)).  Because  d{x)  /  dfx), 
{d{x)  —d'{x))  7^  0.  But  there  are  at  most  ^  different  monic  degree-y  irreducible  polynomials  with 
coefficients  in  F^*  that  are  factors  of  a  nonzero  degree-5  polynomial  {d{x)  —d'{x)).  Hence,  the 
probability  that  p{x)  is  one  of  these  polynomials  is  at  most  — ^—iq-  n 

Division  fingerprinting  is  a  generalization  of  Rabin  fingerprinting.  Both  are  fast  due  to  fast 
implementations  of  P2  [83]  and  P^k  [94]. 

Let  K^ky  =  Vqk[x]/ p{x)  denote  the  extension  field  of  polynomials  with  coefficients  in  F^*  of 
degree  less  than  y,  with  “-f”  defined  as  normal  and  defined  modulo  a  constant  monic  degree-y 
irreducible  polynomial  p{x)  G  F^^-^]-  Let  E^A7[y]  denote  the  set  of  polynomials  with  coefficients  in 
with  “-f”  and  defined  as  normal.  It  is  convenient  to  consider  d  G  E^A7[y]  as  a  polynomial  in 
two  variables,  d{y,x). 

A  vector  d  G  F^^  of  5  elements  of  F^t  has  a  natural  representation  as  a  polynomial  d{y,x)  G 
E^A7[y]  of  degree  less  than  |  in  variable  y.  The  element  of  d  is  the  coefficient  in  d(y,x)  of  degree 
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in  vaiiable  y  and  degree  j  mod  y  in  variable  x.  These  notations  will  be  used  interchangeably, 
denoting  d  as  d{y,x)  when  it  assumes  this  form. 

Theorem  2.1.4.  [Evaluation  fingerprinting]  Let  =  ¥qk[x]/p{x)  denote  a  field  of  polynomials 
with  coefficients  in  of  degree  less  than  y  with  defined  modulo  p{x),  a  constant  monic  degree-y 
iiTeducible  polynomial.  Let  ac  =  {0, . . .  ,q^'^  —  1},  and  let  5  :  be  a  deterministic  algorithm 

that  outputs  an  element  of  chosen  uniformly  at  random,  with  probabilities  taken  over  the  choice 
of  input  r  G  N  uniformly  at  random.  Then  the  function  fingerprint(r,r/)  :  OC  x  — >  F^j.  defined  by 

fingerprint(r,r/(y,v))  :  s{x)  ^  S{r); 

return  d{s{x),x) 


is  an  s-fingerprinting  function  for  s  = 


Proof.  As  in  [75],  this  is  because  any  points  fully  determine  a  polynomial  of  degree  less  than 
^  over  a  field.  Hence,  any  two  distinct  polynomials  of  degree  less  than  ^  share  fewer  than  ^  points. 
Because  there  are  different  points  in  the  probability  that  a  randomly  chosen  point  is  shared 
between  two  distinct  polynomials  is  at  most  % 

A  trivial  implementation  of  S  is  to  return  the  polynomial  representation  of  r  divided  into  y 
coefficients,  where  each  coefficient  is  an  element  of  F^*. 

Valiants  of  division  and  evaluation  fingeiprinting  known  as  the  division  and  evaluation  hashes 
can  be  used  for  message  authentication.  They  are  two  of  the  fastest  hashes,  producing  the  smallest 
output  and  requiring  the  fewest  bits  of  random  input  [79]. 

2.1.2  Homomorphism 

Throughout  this  chapter,  \e.tb-d  denote  the  application  of  by  a  scalar  G  F  to  each  element  in  a 
vector  (7  G  F®  of  a  elements  of  F. 

Deeinition  2.1.5.  A  fingerprinting  function  fingerprint  :  A!  x  F®  — >  F^  is  homomorphic  if 
fingerprint(r,r/)  +fingerprint(r,r/')  =  fingerprint(r,r/  +  r/')  and  Zi-fingerprint(r,r/)  =  fingerprint(r,fi- 
d)  for  any  r  G  A  and  any  d,d'  €¥^,b  £  F. 

Theorem  2.1.6.  The  fingerprinting  functions  given  in  Example  2.1.2  and  Theorem  2.1.3  are 
homomorphic. 


Proof.  Lor  any  d,d'  G  F®*  and  any  r  G  A,,p{x)  ^  Pqk{r), 


f  i  ngerpri  nt  (r,  r/(v) )  +  f  i  ngerpri  nt(r,  r/'(v) ) 


d{x)  mod  p{x)  +  d'{x)  mod  p{x) 
{d{x)  +  d' {x))  mod  p{x) 
fingerprint(r,r/(v)  +  r/'(v)) 
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Moreover,  for  any  b  G  F^*, 

fingerprint(r,Z>-r/(A:)) 


{b-d{x))  mod  p{x) 
b  ■  {d{x)  mod  p{x)) 
b  ■  fingerprint(r,r/(A:)) 


□ 


Theorem  2.1.7.  The  fingerprinting  function  given  in  Theorem  2.1.4  is  homomorphic. 


Proof.  For  any  d,d'  £  F^*  and  any  r  ^  s{x)  ^  S{r), 
fingerprint(r,(7(y,.r))  +  fingerprint(r,d'(y,.r)) 


d{s{x).,x)  +d'  {s{x).,x) 
{dFd'){s{x),x) 
fingerprint(r,d  +  r/') 


Moreover,  for  any  b  G  F^*, 

fingerprint(r,Z7  •  d) 


fingerprint(r,  {b  ■  d){y,x)) 
{b  ■  d){s{x),x) 
b  ■  {d{s{x).,x)) 
b  •  fingerprint(r,d(y,A:)) 


□ 


The  following  lemma  restates  the  properties  of  a  homomorphic  fingerprinting  function. 

Lemma  2.1.8.  Let  fingerprint :  0<i  x  F^ —>  F’'' denote  a  homomorphic  £-fingerprinting  function.  For 
any  fixed  constants  Zr,-  G  F,  1  <  /  <  m. 


max  Pr 


m 

fingerprint(r,r/)  =  •fingerprint(r,r/,) 
1=1 


<  e 


Proof.  Suppose  otherwise.  That  is,  suppose  that  there  are  d,d' ,di, . . .  ,dm  G  F^  such  that  d  d'  = 
i:jL^bi-di?inA 

m  ^ 

fingerprint(r,r/)  =  •fingerprint(r,d,)  :  r  ^  X 


Pr 


1=1 


>  s 


By  homomorphism,  for  any  r  G  X, 

m  m  m 

•  fingerprint(r,(i,)  =  ^fingerprint(r,Z7;-rZ,)  =  fingerprint(r, ^ Zr,  •  <3?()  =  fingerprint(r,rZ') 


1=1 

Then 


1=1 


1=1 


Pr 


fingerprint(r,rZ)  =  fingerprint(r,rZ')  :  r  X 


>  E 


in  violation  of  Definition  2.1.1. 


□ 
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Let  encode(By)  output 

dji 

djn 

Then  [encode(Bi);encode(B2); ■  ■  ■ ; 

encode(Bo)]  outputs 

dll 

d\n 

dll 

d2n 

d^\ 

Letd,-  =  {d\i,d2i, . . .  ,dcii).  That  is,  eachd,  is  a  column  vector  from 

above.  Then  encode®  (B)  outputs 
di  1  ... 

dfi 

Figure  2.1.1:  Encoding  a  vector. 


2.1.3  Applications  to  Erasure  Codes 

Definition  2.1.9.  An  m-of-n  erasure  coding  scheme  is  a  pair  of  deterministic  algorithms 
(encode, decode),  where  encode  :  F"'  — >  F”  and  decode  :  (F  x  {1, . . .  — >  F'”.  If  r/i, . . .  ,(i„  <— 

encode(B),  then  decode(r/,, , . . .  =  B  for  any  distinct  /i , . . . ,  (1  <  ij  <  n). 

Each  fragment  provided  to  decode  is  accompanied  by  its  index  i  G  Eor  notational  sim¬ 

plicity,  let  each  index  be  implicitly  provided  to  decode. 

Definition  2.1.10.  An  m-of-n  erasure  coding  scheme  (encode, decode)  is  linear  if  there  exist 
fixed  constants  bij  €  F  for  each  1  <  /  <  «  and  1  <  y  <  m  such  that  for  any  B  G  F™,  if  r/i , . . . , ^ 
encode(B)  then  d,-  =  Z,jLi  bij  ■  dj. 

Examples  of  linear  erasure  coding  schemes  are  Reed-Solomon  codes  [85]  and  Rabin’s  Information 
Dispersal  Algorithm  [84]. 

The  following  three  shorthands  will  be  useful  for  the  next  theorem.  Eirst,  to  consider 
only  the  encoded  fragment,  define  fhe  shorthand  di  <—  encode,  (B).  Second,  abbreviafe 
encode(decode(d4,...,r/,,„))  as  encode(r/,, , . . .  ,r/,„).  Third,  fo  apply  encode  or  decode  to  each 
of  the  /*  elements  of  m  a-length  vectors  for  every  j  G  {l,...,a},  define  fhe  shorfhands 
encode®  :  (F®)'”  — >  (F'^)"  and  decode®  :  (F®)'”  — >  (F®)™.  Then  di,...,d„  <—  encode®(B)  and 
B  ^  decode® (r/,-, , . . .  ,di^),  where  B  G  (F®)™  and  di  G  F®.  See  Figure  2.1.1  for  illusfrafion. 

Theorem  2.1.11.  Eel  fingerprint :  x  F^  — >  F"''  be  a  homomorphic  s-fingerprinling  funclion,  and 
lef  (encode,  decode)  be  a  linear  erasure  code  wilh  coefficienls  bij  G  F,  for  1  <  /  <  «  and  1  <  y  <  m. 
If  {di,...,dn)  ^  encode^ (B),  Ihen  for  any  r  G  A  and  any  !</<«, 

fingerprint(r,r/,)  =  encodeJ(fingerprint(r,r/i), . . .  ,fingerprint(r,r/m)) 
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Proof. 

fingerprint(r,J,)  =  fingerprint(r,encodef(B)) 

m 

=  fingerprint(r,  ^  bij  •  dj)  (by  Definition  2.1.10) 

1=1 

m 

=  ^  bij  ■  fingerprint(r,(3fy)  (by  Definition  2.1.5) 

./=i 

=  encodeJ(fingerprint(r,(3fi), . . .  ,fingerprint(r,cfm)) 

(by  Definition  2.1.10) 

□ 


Corollary  2.1.12.  Let  fingerprint :  x  be  a  homomorphic  s-fingerprinting  function, 

and  let  (encode, decode)  be  a  linear  erasure  code.  If  (r/i, . . .  ,r/„)  <—  encode® (B),  then  for  any  d  dt, 


Pr 


fingerprint(r,r/)  =  encodeJ(fingerprint(r,r/i), . . .  ,fingerprint(r,r/,„))  :  r  ^  ^ 


<  E 


Proof.  Follows  from  Theorem  2.1.11  and  Lemma  2.1.8.  □ 

Theorem  2.1.11  and  Corollary  2.1.12  state  two  useful  facts  about  homomorphic  fingerprinting 
functions.  First,  the  fingerprints  from  an  encoding  of  a  block  are  equal  to  the  encoding  of  the 
fingerprints  of  the  block.  That  is,  homomorphic  fingerprinting  functions  are  homomorphic.  Second, 
if  the  fingerprint  of  a  fragment  is  equal  to  the  encoding  of  the  fingerprints  of  other  fragments,  the 
fragment  is,  with  high  probability,  the  encoding  of  the  other  fragments.  That  is,  homomorphic 
fingerprinting  functions  are  fingerprinting  functions. 

2.2  Fingerprinted  Cross-checksum 

The  fault-tolerant  data  storage  example  given  in  Section  2.3  utilizes  a  data  structure  that  called  a 
fingerprinted  cross-checksum.  Before  considering  the  contents  of  a  fingerprinted  cross-checksum, 
recall  the  following  definition  of  a  collision-resistant  hash  function  (e.g.,  see  [90]). 

Definition  2.2.1.  A  family  of  hash  functions  {hashu  ■  {0, 1}*  ^  {0,  is  {T,£')-collision 

resistant  if  for  every  probabilistic  algorithm  A  that  runs  in  time  x, 

d'  f^d  A  hashnid')  =  hashK{d)  :  , 

{d,d’)^A{K) 

A  fingerprinted  cross-checksum  then  has  the  following  form. 

Definition  2.2.2.  An  m-of-n  fingerprinted  cross-checksum  fpcc  consists  of  an  array  fpcc.cc[]  of 
n  values  in  {0, 1}^  and  an  array  fpcc.fp[]  of  m  values  in  FT 
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The  name  “fingerprinted  cross-checksum”  derives  from  the  fact  that  the  array  fpcc.cc[]  is  a 
cross-checksum  [43,  58]  and  because  fpcc.fp[]  holds  homomorphic  fingerprints. 

Let  hash  :  {0, 1}*  — >  {0, 1}^  denote  a  random  instance  of  a  (T,s')-collision  resistant  hash  func¬ 
tion  family,  and  let  fingerprint  :  x  be  a  homomorphic  s-fingeiprinting  function.  Let 

random  .oracle  :  ({0, 1}^)”  — >  ac  denote  a  random  oracle  [12],  which  is  a  fixed,  public  function  cho¬ 
sen  uniformly  at  random  from  all  functions  from  the  same  domain  to  the  same  range.  The  following 
definition  specifies  when  a  fragment  is  consistent  with  a  fingerprinted  cross-checksum. 
Definition  2.2.3.  Let  fpcc  be  a  fingerprinted  cross-checksum.  A  fragment  r/  €  F^  is  consistent 
with  fpcc  for  index  i,  1  <  /  <  n,  if 

fpcc.cc[/]  =  hash{d) 
and 

fingerprint(r,d)  =  encodeJ(fpcc.fp[l], . . .  ,fpcc.fp[m]) 
where  r  =  random_oracle(fpcc.cc[l], . . .  ,fpcc.cc[«]). 

Theorem  2.2.4.  Let  A  be  a  probabilistic  algorithm  that  runs  in  time  x,  makes  %  queries  to 
random.oracle,  and  produces  an  m-of-n  fpcc  and  fragments  r/,, ,  . . . ,  dt^,  and  d'., ,  . . . ,  d'-,  such  that 
each  fragment  is  consistent  with  fpcc  for  its  index.  If 

B  ^  decode^(d, (i,„J 
B'  ^  decode® (J/  ,...,d;,  ) 

then  the  probability  that  B  ^  B'  is  at  most  C  +  tM  •  s  for  constant  fW  =  J  ■ 

Proof.  Suppose  that  A,  running  in  time  x,  produces  some  fpcc  and  fragments  di^,...,di^^  and 
djt ^d-t  ,  each  consistent  with  fpcc  for  its  index,  such  that  if  B  <—  decode®  (d;, , . .  .,di^)  and  B'  ^ 

decode^  {d'ji ,d'-,  )  then  Bf^B'.Bf^B'  implies  that  for  some  j,  I  <  j  <  m,  dij  encode®  (B').  Yet, 
because  each  fragment  is  consistent  with  fpcc,  for  each  di  e  {di.,dU...,d'.,  },  by  Definition  2.2.3 

fingerprint(r,d,)  =  encodeJ(fpcc.fp[l], . . .  ,fpcc.fp[m]) 

where  r  =  random_oracle(fpcc.cc[l], . . .  ,fpcc.cc[«]).  By  Definition  2.1.9,  this  can  be  rearrange  to 

fingerprint(r,d,v)  =  encodeJ^(fingerprint(r,d[/ ), . . .  ,fingerprint(r,B-/  )) 

Bound  the  probability  with  which  A  succeeds  in  producing  such  values,  as  follows.  First 
suppose  that  A  fails  to  produce  a  collision  in  hash.  Then,  for  any  random  oracle  query  f  ^ 
random_oracle(hi, . . . ,  h„),  A  possesses  at  most  one  di  such  that  hash{di)  =  h,-,  for  each  1  <  /  <  n. 

A.  A.  A.  A.  A. 

Of  these  n  fragments  d\,...,dn,  consider  each  selection  of  ni  -|-  1  of  them,  dj^ ,  , . . . ,  such  that 

B  ^  decode®((f,j , . . .  implies  encode®  (B).  This  selection  satisfies  fingerprint(f,t^g)  = 
encodeJ^(fingerprint(f,(f,j), . . .  ,fingerprint(f,(ii'^))  with  probability  at  most  e,  by  Corollary  2.1.12. 
Since  there  are  at  most  j)  such  selections  per  random  oracle  query,  and  since  there  are  %  queries 
to  the  random  oracle,  the  probability  with  which  A  generates  any  such  dt^,.. . , di^^^  without  finding 
a  collision  in  hash  is  at  most  M  ■  s  where  M  =  x(,„”  J-  Adding  the  probability  s'  that  A  finds  a 
collision  in  hash,  the  total  probability  of  A’s  success  is  bounded  by  s'  -|-  fW  •  s.  □ 
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2.3  Example:  Improving  Avid 

This  section  illustrates  how  homomoiphic  fingerprinting  can  improve  distributed  protocols  by  mod¬ 
ifying  the  Avid  [18]  protocol  to  make  it  more  bandwidth  efficient.  Section  2.3.1  describes  Avid. 
Section  2.3.2  highlights  the  proposed  modifications.  Section  2.3.3  provides  a  complete  description 
along  with  pseudo-code  of  the  modified  protocol,  AviD-FP.  Section  2.3.4  proves  that  AviD-FP 
satisfies  the  functional  specification  of  an  asynchi'onous  verifiable  information  dispersal  protocol 
given  in  [18].  Both  the  Avid  and  Avid-fp  protocols  can  be  used  to  build  a  Byzantine  fault-tolerant 
distributed  storage  system  using  only  3/  -|-  1  servers  [19] ,  where  /  is  an  upper  bound  on  the  number 
of  faulty  servers. 

This  section  assumes  that  there  are  n  servers  and  that  a  data  block  is  erasure  coded  into  frag¬ 
ments  such  that  any  m  fragments  suffice  to  decode  it,  where  m  >  /  -|-  1  and  n  =  ni  -|-  2/.  The  system 
model  is  similar  to  that  in  [18];  there  are  authenticated,  reliable,  asynchronous  point-to-point  com¬ 
munications  channels  between  all  servers  and  clients,  and  all  servers  and  clients  are  computationally 
limited  so  as  to  be  unable  to  break  the  utilized  cryptographic  primitives. 

2.3.1  Avid 

Avid  [18]  is  an  asynchronous  verifiable  information  dispersal  protocol.  In  such  a  protocol,  a  client 
disperses  some  block  B,  which  can  later  be  retrieved  by  any  client.  The  verifiability  of  the  protocol 
ensures  that  any  two  clients  retrieve  the  same  block  after  dispersal. 

For  simplicity,  the  description  of  Avid  in  this  section  is  restricted  tom  =  f  +  l  and  n  =  3/  -|-  1. 
To  write  a  block,  a  client  encodes  it  into  fragments  and  computes  the  hash  of  every  fragment,  cre¬ 
ating  a  cross-checksum.  The  client  sends  to  each  server  its  fragment  and  the  cross-checksum.  Each 
server  then  echoes  the  cross-checksum  and  its  fragment  to  all  other  servers  in  an  echo  message. 
After  receiving  2/-|-  1  fragments  and  matching  cross-checksums  in  echo  messages,  a  correct  server 
decodes  the  block,  re-encodes  it,  and  verifies  each  component  of  the  cross-checksum,  aborting  if 
inconsistencies  are  found.  A  correct  server  then  broadcasts  this  consistent  cross-checksum  and  its 
fragment  from  the  re-encoding  to  all  other  servers  in  a  ready  message.  A  correct  server  does  like¬ 
wise  if  it  receives  f  +l  ready  messages  before  it  receives  2/-|-  1  echo  messages.  After  receiving 
2/ -I-  1  ready  messages,  a  correct  server  can  conclude  that  /-|-  1  servers  broadcast  ready  messages 
that  all  correct  servers  will  eventually  receive.  Hence,  all  correct  servers  will  broadcast  ready  mes¬ 
sages,  and  so  all  will  receive  at  least  2/-|-  1  such  messages  and  reach  this  point.  The  server  can  then 
reconstruct  its  fragment  if  needed  and  store  this  value.  The  bandwidth  required  to  store  block  B  is 
then  =  0{n^^^\B\)  =  0{n\B\),  assuming  the  cross-checksum  is  of  negligible  size. 

To  read  a  block,  a  client  retrieves  a  fragment  and  cross-checksum  from  each  server  until  it 
finds  a  matching  cross-checksum  from  f  +l  servers  and  m  fragments  that  are  consistent  with  this 
cross-checksum.  These  fragments  are  decoded  and  returned. 

2.3.2  Avid-fp 

This  section  modifies  Avid  to  utilize  homomoiphic  fingeiprinting,  creating  a  new  protocol,  Avid- 
fp.  Avid-fp  differs  from  Avid  in  that  servers  agree  upon  a  fingerprinted  cross-checksum  that 
is  consistent  with  a  block  rather  than  on  the  block  itself;  servers  need  not  echo  fragments.  The 
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bandwidth  required  to  store  block  B  in  AviD-FP  is  then  =  0{ |  B | )  =  0(|B|),  assuming 

a  fingerprinted  cross-checksum  is  of  negligible  size. 

In  Avid-fp,  each  cross-checksum  is  replaced  by  the  fingerprinted  cross-checksum  from  Sec¬ 
tion  2.2.  Unlike  a  cross-checksum,  a  server  can  verify  with  a  fingerprinted  cross-checksum  that  its 
fragment  corresponds  to  a  unique  block  without  knowing  the  entire  block.  As  a  consequence,  there 
is  no  need  to  send  a  fragment  along  with  each  echo  or  ready  message,  which  saves  substantial 
bandwidth.  Furthermore,  a  server  has  nothing  to  re-encode  and  verify  upon  receiving  an  echo  or 
ready  message,  saving  a  substantial  amount  of  computation. 

A  less  welcome  consequence  is  that  a  correct  server  cannot  reconstruct  its  fragment  if  it  is 
not  provided  by  the  client.  This  is  not  a  problem,  however,  because  a  server  can  still  verify  that 
enough  other  correct  servers  received  consistent  fragments  such  that  a  consistent  block  will  always 
be  retrievable  in  the  future.  Hence,  after  a  block  is  dispersed,  at  least  /  -|-  1  correct  servers  will 
know  the  agreed-upon  fingerprinted  cross-checksum  and  at  least  m  will  know  their  fragments.  To 
read  a  block,  a  client  retrieves  these  /  -I-  1  matching  fingerprinted  cross-checksums  and  m  consistent 
fragments. 

2.3.3  Avid-fp  Pseudo-code 

Pseudo-code  for  AviD-FP  can  be  found  in  Figure  2.3.2.  In  order  to  disperse  a  value  B  in  AviD-FP, 
a  client  generates  fragments  (line  101)  and  the  fingerprinted  cross-checksum  (lines  102-104).  The 
client  then  sends  each  server  its  fragment  and  the  fingerprinted  cross-checksum. 

Each  server  verifies  that  the  fragment  it  receives  is  consistent  with  the  fingerprinted  cross¬ 
checksum  (lines  600-606).  If  this  is  true,  the  server  stores  the  fragment  and  sends  an  echo  message 
containing  the  fingerprinted  cross-checksum  to  all  other  servers  (lines  607-608). 

Upon  receiving  m-\-f  echo  messages  with  matching  fingerprinted  cross-checksums  from  unique 
servers,  a  server  can  determine  that  at  least  m  correct  servers  sent  such  messages  and  hence  stored 
fragments  consistent  with  the  fingerprinted  cross-checksum  (line  701).  The  server  then  sends  a 
ready  message  containing  the  fingerprinted  cross-checksum  to  all  other  servers  (line  702). 

If  a  server  receives  f  +  I  ready  messages  with  matching  fingerprinted  cross-checksums  from 
unique  servers  (line  801),  at  least  one  must  be  from  a  correct  server  that  determined  that  at  least  m 
correct  servers  stored  consistent  fragments.  Hence,  such  a  server  can  determine  likewise  and  send  a 
ready  message  to  all  other  servers,  if  it  has  yet  to  do  so  (line  802). 

If  a  server  receives  2/4-1  ready  messages  with  matching  fingerprinted  cross-checksums  from 
unique  servers,  at  least  /  4-  1  must  be  from  correct  servers  (line  804).  Hence,  each  correct  server 
will  receive  at  least  these  /  4-  1  matching  ready  messages.  Then  each  correct  server  will  send  a 
ready  message  (lines  801-802),  so  each  correct  server  will  actually  receive  at  least  2/4-1  matching 
ready  messages.  Thus,  a  correct  server  can  conclude  upon  receiving  2/4-1  ready  messages  that 
all  correct  servers  will  eventually  receive  2/4-1  ready  messages,  as  well.  The  server  can  then  save 
the  agreed  upon  fingerprinted  cross-checksum  and  respond  to  the  client  (line  806). 

Upon  receiving  2/4-1  responses,  the  client  is  assured  that  /  4-  1  correct  servers  have  saved 
the  same  fingerprinted  cross-checksums  and  that  m  correct  servers  have  stored  fragments  consistent 
with  this  fingerprinted  cross-checksum.  To  retrieve  a  block,  then,  a  client  retrieves  a  fragment  and 
fingerprinted  cross-checksum  from  each  server,  waiting  for  matching  fingerprinted  cross-checksums 
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c_disperse(B):  /*  Client  disperse  protocol  */ 

100:  Store-Count  ^  0 

101:  di , . . .  ,d„  encode^{B) 

102:  for  (r  G  {!,. ..,«})  do  fpcc.cc[/]  hash(d,) 

103:  r  ^  random_oracle(fpcc.cc[l], . . .  ,fpcc.cc[«]) 

104:  for  (i  G  ,m})  do  fpcc.fp[/]  ^  fingerprint(r, d/) 

105:  for  (/ G  {!,. ..,«})  do  send(disperse,fpcc,d/)  to  S/ 

Upon  receiving  (stored)  from  S,  for  the  first  time 

200:  Store-Count  •«— Store-Count  +  1 

201:  if  (store_count  —  2/+  1)  then  return  SUCCESS 

c_retrieve():  /*  Client  retrieve  protocol  */ 

300:  fpcc  ^  NULL;  State[^]  ^  (NULL,  NULL,  NULL) 

301:  for  (f  G  {1,...,«})  do  send(retrieve)  to  S/ 

Upon  receiving  (retrieved. fpcc,  (fpcc^d))  from  S, 

400:  if  (fpcc  ^  null)  then 

401:  State[i\  ^  (fpcc, NULL, NULL)  /*  Potential  fpcc  */ 

402:  if  (|{7  :  State[j]  —  (fpcc, *, *)}|  —  /  +  1)  then 

403:  fpcc  ^  fpcc  /*  Found  fpcc  */ 

404: 

405:  if  (null  ^  fpcc^)  then 
406:  h  ^  hash(d) 

407:  r  <—  random_oracle(fpcc^cc[l], . . .  ,fpcc^cc[n]) 

408:  fp  ^  fingerprint(r,  d) 

409:  fp^  encodeJ(fpcc^fp[l], . . .  ,fpcc^fp[rt^]) 

410:  if  (fp  —  fp'  A  h  —  fpcc^cc[^])  then 

411:  (fpcc,fpcc^,d)  /*  Consistent  fragment  */ 

412: 

413:  if  (fpcc  ^  null)  then 

414:  Frags  ^  {dy  :  State[j\  —  (*,fpcc,dy)} 

415:  if  {\Frags\  —  m)  then  return  decode^(Frag5) 
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sjnit():  /*  Initialize  server  state  */ 

500:  echoed  ^  (null, null);  verified  ^  NULL 
501:  EchoSet^  ■«—  0;  Ready Set^  0 

/*  Server  i  code  to  disperse  data  */ 

Upon  receiving  (disperse,  fpcc,  d,)  from  client 
600:  h  ^  hash(d,) 

601:  h' ^  fpcc.cc[/] 

602:  r  <—  random_oracle(fpcc.cc[l], . . .  ,fpcc.cc[«]) 

603:  fp  ^  fingerprint(r,  d/) 

604:  fp^  encodeJ(fpcc.fp[l], . . . ,  fpcc.fp[m]) 

605: 

606:  if  (echoed  —  (NULL,  NULL)  A  fp  —  fp^  A  h  —  h^)  then 
607:  echoed  ■«—  (fpcc,d/) 

608:  for  {j  G  {!,...,«})  do send(echo,fpcc)  to  Sy 

Upon  receiving  (echo,  fpcc)  from  Sy 
700:  EchoSetfpcc  EchoSetfpcc  U  {j} 

701:  if  {\EchoSetfpcc\  —  fn  +  f  A  \ReadySetfp^^\  <  f i)  then 
702:  for  (7  G  {1,  •  -  • ,»})  do  send  (ready,  fpcc)  to  Sy 

Upon  receiving  (ready,  fpcc)  from  Sy 
800:  ReadySetfp^^.  <—  ReadySetfp^^  U  {7} 

801:  if  {\ReadySetfp^^.\  —  f 1  A  <  ^  +  /)  then 

802:  for  (7  G  {1,  •  -  • ,»})  do  send(ready,fpcc)  to  Sy 

803: 

804:  if  {\ReadySetfp^^.\  —  2/  +  1)  then 

805:  verified  ^  fpcc 

806:  send(stored)  to  client 

/*  Server  i  code  to  retrieve  data  */ 

Upon  receiving  (retrieve)  from  client 

900:  send(retrieved, verified, echoed)  to  client 


Figure  2.3.2:  AviD-FP  pseudo-code 


from  /  +  1  servers  (lines  402^03)  and  consistent  fragments  from  m  servers  (line  415).  These 
fragments  are  then  decoded  and  the  resulting  block  is  returned. 

2.3.4  Avid-fp  Correctness 

To  see  why  this  is  correct,  recall  the  definition  of  an  asynchronous  verifiable  information  dispersal 
scheme  given  in  [18]: 

Definition  2.3.1.  An  {m,n)-asynchronous  verifiable  information  dispersal  scheme  is  a  pair  of 
protocols  (disperse,  retrieve)  fhaf  satisfy  fhe  following  wifh  high  probabilify: 

Termination:  If  disperse(B)  is  initialed  by  a  correcf  clienl,  fhen  disperse(B)  is  evenfually  com- 
pleled  by  all  correcf  servers. 

Agreement:  If  some  correcf  server  completes  disperse(B),  all  correcf  servers  evenfually  com¬ 
plete  disperse(B). 

Availability:  If  /  -F  1  correcf  servers  complete  disperse(B),  a  correcf  clienl  lhal  initiates 
retrieve()  evenfually  reconsfrucfs  some  block  B'. 

Correctness:  After  f  +l  correcf  servers  complete  disperse(B),  all  correcf  clienfs  lhal  initiate 
retrieve()  evenfually  relrieve  fhe  same  block  B'.  If  fhe  clienl  lhal  initialed  disperse(B)  was  correcf, 
fhen  B'  =  B. 
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Termination  is  simple,  as  in  the  original  Avid  protocol.  If  a  correct  client  initiates  disperse,  it 
erasure  codes  the  block  and  computes  a  valid  fingerprinted  cross-checksum  before  dispersing  frag¬ 
ments  to  each  server  (lines  101-105).  Eventually,  at  least  m  +  f  correct  servers  receive  disperse 
messages,  verify  their  fragments  against  the  fingerprinted  cross-checksum,  and  send  echo  messages 
to  all  other  servers  (line  608).  Each  correct  server  eventually  receives  at  least  m-|-/  echo  messages; 
it  will  then  send  a  ready  message  (line  702)  unless  it  has  already  done  so  (line  802).  Thus  each 
correct  server  will  eventually  receive  at  least  2/-|-  1  ready  messages,  at  which  point  it  will  send  a 
stored  message  to  the  client  and  complete.  Hence,  all  correct  servers  eventually  complete. 

Agreement  is  simpler  than  in  the  original  Avid  protocol  because  a  server  in  AviD-FP  need 
not  reconstruct  the  block  before  returning  a  ready  message.  If  some  correct  server  completes 
disperse(B),  then  it  received  2/-|- 1  ready  messages  (line  804).  At  least  /-|- 1  must  have  come  from 
correct  servers,  so  all  correct  servers  will  eventually  receive  ready  messages  from  these  servers. 
Then  the  condition  satisfied  on  eifher  line  801  or  line  701  will  be  mel  for  all  correct  servers,  so  all 
correct  servers  will  send  ready  messages  and  receive  at  least  2/+  1  such  messages,  thus  complet¬ 
ing. 

Availability  is  different  than  in  the  original  Avid  protocol.  In  Avid,  fragments  must  be  echoed 
such  that  a  correct  server  can  reconstruct  its  fragment  if  needed;  in  AviD-FP,  fragments  are  not 
echoed.  If  any  correct  server  completes  disperse,  it  received  2/4-1  ready  messages.  Then  at  least 
one  correct  server  received  m-\-  f  echo  messages.  If  not,  at  most  /  ready  messages  would  be 
received  by  any  correct  server,  because  no  correct  server  would  meet  the  condition  on  line  701. 
Hence,  at  least  m  correct  servers  stored  consistent  fragments  (line  607).  Then  after  /  4-  1  correct 
servers  complete  disperse,  a  client  that  initiates  retrieve  will  eventually  receive  /  4-  1  matching 
fingerprinted  cross-checksums  (saved  on  line  805)  along  with  m  consistent  fragments,  which  it  will 
decode  and  return  as  some  block  B'. 

Correctness  is  similar  to  the  original  Avid  protocol  except  that  the  properties  of  the  homomor¬ 
phic  fingerprint  are  required.  Suppose  some  correct  server  saves  fpccj  on  line  805  and  some  other 
correct  server  saves  fpcc2  7^  fpccj.  Then  m-\-  f  servers  echoed  fpccj,  of  which  at  least  m  were 
correct,  and  m-\-  f  servers  echoed  fpcc2,  of  which  at  least  m  were  correct.  Because  a  correct  server 
will  only  echo  once  (line  606  will  never  be  satisfied  after  line  607  is  reached),  there  are  at  least 
m-\-m  +  f  servers  involved,  which  is  a  contradiction  (there  are  only  n  <m  +  m-\-  f  servers  in  the 
system).  Hence,  any  block  decoded  during  retrieve  is  consistent  with  the  same  fpcc.  Eurthermore, 
if  a  correct  client  initiated  disperse(B),  this  fpcc  will  be  consistent  with  B.  Then,  by  Theorem  2.2.4, 
the  probability  that  B  B'  is  negligible,  for  appropriately  chosen  parameters. 


2.4  Performance 

Homomorphic  fingerprinting  is  efficient,  contributing  little  overhead  to  distributed  protocols.  To 
demonstrate  that  homomorphic  fingerprinting  is  not  a  substantial  computational  burden  in  protocols 
such  as  the  AviD-FP  protocol  given  above,  this  section  compares  an  implementation  of  the  evalu¬ 
ation  fingeiprinting  function  against  cryptographic  hashing.  The  evaluation  fingerprinting  function 
implementation  in  this  section  is  similar  to  the  evaluation  hash  considered  in  [79]  and  [94]. 

A  polynomial 


d{y,x)  =  aa{x)  ■  ao{x)  -y”  G 
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can  be  evaluated  using  Homer’s  rule.  To  do  so,  let  fp  <—  0,  and  for  7  =  a, ...  ,0,  iteratively  compute 
fp  ^  fp  •  y  +  aj{x).  The  efficiency  of  this  implementation  then  depends  on  an  efficient  implementa¬ 
tion  of  “+”  and  for  aj{x),y  G  [x]/p{x),  where  y,  the  point  at  which  to  evaluate,  is  the 

fixed  random  value  s(v)  ^  S{r). 

Given  an  implemenfafion  of  “+”  and  for  consfrucf  “+”  and  for  ¥^k[x]/ p{x)  as 
follows.  Consider  fhe  represenfafion  of  a{x)  G  ¥gk[x]/ p(x)  as  a  polynomial 

a{x)  =  by-i  +  . .  .  +  bo 

where  bi  G  F^*.  The  “+”  operafor  is  defined  as  fhe  addifion  of  same-degree  ferms.  The  operafor 
is  defined  as  mulfiplicafion  of  fwo  polynomials  of  degree  less  fhan  y  modulo  a  consfanf  monic 
degree-y  irreducible  polynomial  p{x)  G  F^i  [v]. 

For  fixed  ^(v) ,  compufe  a{x)  ■  ^(x)  as  follows.  For  0  <  /  <  y,  build  y  lookup  fables  mapping  each 
bi  G  F^t  fo  bi-x'  ■  ^(x)  mod  p{x)',  fhaf  is,  compute  fhe  map  bi  1— >  bi  -x'  ■  s{x)  mod  p{x).  Each  of  fhese 
y  fables  will  confain  enfries  fhaf  are  each  [log2  <7^]  bifs  wide.  A  128-bil  fingerprinf  over  F28  musf 
compufe  16  such  fables  affer  fhe  random  value  r  is  selecfed;  each  fable  is  4  kB,  for  a  fofal  of  64  kB. 
Given  fhese  fables,  one  can  compute  a(x)  •  ^(x)  as  fhe  sum  of  y  lookups,  {bi-x'  ■  s{x)  mod  p{x)). 
For  F28,  fhis  requires  a  fable  lookup  plus  an  exclusive-or  per  byte  of  inpuf. 

For  F28,  building  fhese  fables  is  efficienf:  “-I-”  is  simply  exclusive-or,  and  can  be  imple¬ 
mented  using  a  64  kB  lookup  fable.  The  “mod”  operafor  can  be  defined  using  “-I-”  and  Because 
p{x)  is  consfanf,  “mod”  can  be  implemented  wifh  a  lookup  fable  for  bi  1— >  bi  -x^  mod  p{x).  This 
fable  will  confain  2^  enfries  of  y  byfes  each,  for  a  fofal  of  4  kB  for  a  128-bil  fingerprinf,  and  if  can 
be  compuled  before  fhe  random  value  r  is  selecfed. 

Gladman’s  implemenfafion  of  SHA-1  [42]  achieves  a  Ihr'oughpul  of  110  megabyles  per  second 
on  a  3  GHz  Intel  Pentium  D.  On  fhis  machine,  fhe  fime  lo  compufe  lookup  fables  for  fhe  evalu¬ 
ation  fingerprinf  implemenfafion  presenfed  here  is  20  microseconds.  Affer  fhis  compulation,  fhis 
implemenfafion  achieves  a  Ihroughpul  of  410  megabytes  per  second. 


2.5  Other  Protocols 

m-of-n  erasure  coding  is  used  in  many  dislribuled  systems  (e.g.,  [5,  19,  22,  44,  62,  91]),  because  if 
reduces  slorage,  nelwork  bandwidlh,  and  I/O  bandwidfh.  The  savings  approaches  a  factor  of  m  when 
compared  fo  replicalion.  The  division  and  evalualion  fingerprinfing  functions  are  homomorphic  over 
several  popular  erasure  codes.  Reed-Solomon  codes  [85]  inferpolate  a  polynomial  over  a  field  F^*, 
and  Rabin’s  Information  Dispersal  Algorilhm  [84]  encodes  using  m  nxm  malrix  over  a  field  ¥gk 
where  every  m  x  m  submalrix  is  invertible.  Bolh  are  linear  erasure  codes  over  F^*.  A  common  field 
is  F28  such  fhaf  field  elemenfs  are  bytes.  Rabin  fingerprinting  is  homomorphic  over  many  erasure 
codes  based  solely  on  exclusive-or,  such  as  Online  Codes  [74]  and  parify. 

Homomorphic  fingerprinfing  provides  benefifs  fo  erasure-coded  Byzanline  faulf-foleranf  stor¬ 
age  systems  [19,  44].  Section  2.3  demonsfrafed  how  fhe  Avid  protocol  [18],  used  in  [19],  can 
exploil  homomorphic  fingerprinfing  fo  be  more  bandwidlh  efficienf.  Varianls  of  fhe  Pasis  profo- 
col  [44,  45]  can  also  exploif  homomorphic  fingerprinfing.  In  fhe  “non-repairable”  protocol  a  writer 
sends  fragmenls  along  wifh  a  cross-checksum  lo  each  server;  a  reader  relurns  a  block  affer  finding 
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sufficient  servers  with  fragments  and  matching  cross-checksums.  Before  accepting  a  value,  a  reader 
must  reconstruct  all  fragments  and  recompute  the  cross-checksum,  a  significant  computational  over¬ 
head.  This  protocol  can  benefit  directly  by  replacing  the  cross-checksum  with  a  fingerprinted  cross¬ 
checksum,  obviating  the  need  for  fragment  reconstruction  and  cross-checksum  recomputation.  The 
“repairable”  protocol  can  also  benefit,  but  requires  further  modifications. 

Homomorphic  fingerprinting  may  also  provide  benefits  to  erasure-coded  broadcast  [22],  content 
distribution  [61],  and  similar  applications,  if  the  encoding  is  not  trusted  to  be  consistent  without 
verification. 


2.6  Related  Work 

A  common  cryptographic  application  of  universal  hashing  is  for  message  authentication  codes 
(macs)  [79].  An  early  proposal  by  Ki'awczyk  [59]  included  a  MAC  similar  to  Rabin’s  fingerprints. 
Shoup  presented  faster  variants  [94]  along  with  implementation  suggestions  to  optimize  perfor¬ 
mance.  Nevelsteen  compares  several  other  variants  [79]. 

Homomorphic  fingerprinting  functions  share  homomorphic  properties  with  incremental  hash¬ 
ing  functions  [11].  Incremental  hashing,  however,  is  substantially  slower  because  it  is  based  on 
number-theoretic  primitives.  The  homomorphic  properties  of  incremental  hashing  are  exploited 
in  [61],  which  applies  these  homomorphic  properties  to  Online  Codes  [74]  in  a  peer-to-peer  content 
distribution  network. 

The  algebraic  properties  of  certain  universal  hashes  has  been  examined  before.  Rabin  used 
these  properties  to  update  the  fingerprint  of  a  file  [83].  In  [93],  a  similar  technique  is  used  by  a  disk 
scrubber  to  check  the  consistency  of  erasure-coded  data  in  a  benign  environment.  In  [15],  algebraic 
properties  are  leveraged  to  permit  fast  updates  of  Rabin  fingerprints  of  data  structures  such  as  trees. 

More  distantly  related  to  this  technique  is  verifiable  secret  sharing  (e.g.,  [29,  39,  80,  98]),  which 
allows  correct  participants  to  verify  that  a  secret  was  shared  among  them  consistently.  The  secrecy 
of  the  shared  value,  however,  which  must  be  preserved  throughout  the  share  distribution  and  verifi¬ 
cation  process,  drives  these  protocols  to  employ  number-theoretic  techniques  that  are  significantly 
heavier-weight  than  considered  here. 

It  is  worth  mentioning  that  a  random  oracle,  as  in  Section  2.2,  can  be  replaced  with  an  evaluation 
of  a  distributed  pseudo-random  function  [78]  in  a  protocol  such  as  AviD-FP.  This  construction  has 
the  benefit  of  requiring  only  standard  cryptographic  assumptions. 

2.7  Conclusion 

Homomorphic  fingerprinting  enables  efficient  verification  that  fragments  have  been  correctly  gen¬ 
erated  by  an  erasure-coding  of  a  particular  data  block.  A  high  level  of  security  can  be  achieved 
with  small  fingerprints,  and  fingerprint  generation  has  lower  computational  overhead  than  crypto¬ 
graphic  hashing.  This  technique  provides  benefits  to  several  distributed  protocols.  In  particular, 
distributed  storage  systems  capable  of  tolerating  Byzantine  clients,  which  may  attempt  to  write  sets 
of  fragments  that  reconstruct  different  values  depending  upon  which  subset  is  used,  can  benefit 
significantly  from  this  mechanism. 


Chapter  3 

The  Correctness  of  Distributed  Systems 
in  the  Presence  of  Faulty  Clients 


As  the  complexity  and  significance  of  distributed  systems  increases,  ensuring  that  such  systems  op¬ 
erate  as  expected  becomes  both  more  difficult  and  more  important.  Thus,  proposals  for  systems  that 
depend  on  complex  distributed  protocols  are  often  accompanied  by  arguments  that  such  protocols 
remain  live  under  appropriate  conditions  and  implement  the  expected  operational  semantics.  Such 
arguments  are  called  liveness  and  safety  arguments.  There  are  several  properties  that  a  client-server 
protocol  may  ensure  to  demonstrate  liveness  (e.g.,  wait-freedom  [49]  or  obstruction-freedom  [50]). 
When  the  protocol  model  is  a  set  of  servers  providing  clients  access  to  a  concurrent  object,  safety 
is  often  described  in  terms  of  linearizability  [51],  which  allows  a  concuiTent  object  to  be  reasoned 
about  in  terms  of  its  sequential  specification. 

As  distributed  systems  grow  in  size  and  importance,  they  must  also  tolerate  faults  other  than 
crashes.  Unfortunately,  safety  conditions  such  as  linearizability  do  not  apply  when  clients  may 
disobey  the  protocol  specification  because  predicates  needed  to  demonstrate  such  conditions  may 
not  be  well  defined.  Linearizabilify  relafes  pofenfial  real-fime  precedence  fo  fhe  invocafions  and 
responses  of  operafions  as  issued  and  seen  by  clienfs,  buf  a  faulfy  clienf  may  nof  properly  invoke 
an  operafion.  Though  a  protocol  need  nof  provide  guaranfees  abouf  responses  provided  fo  faulfy 
clienfs,  if  musf  accounf  for  fhe  effecf  of  invocafions  by  faulfy  clienfs.  Affer  all,  a  faulfy  clienf  can 
always  invoke  an  operafion,  if  only  by  following  fhe  profocol  correcfly. 

This  chapfer  considers  fhe  correcfness  of  disfribufed  systems  fhaf  folerafe  Byzantine-faulty 
clienfs,  which  may  exhibif  arbitrary  or  even  malicious  behavior,  and  require  a  safety  condition 
similar  to  linearizability.  Byzantine  fault  tolerance  [65]  has  become  increasingly  important  in  the 
distributed  systems  community.  Safety  conditions  similar  to  linearizability  define  fhe  correcf  behav¬ 
ior  of  Byzantine  faulf-foleranf  systems  such  as  replicafed  sfafe  machines  [2,  24,  32,  55]  and  sforage 
systems  [19,  44,  48,  66,  68]. 

Of  course,  definitions  of  safely  in  fhe  presence  of  faulfy  clienfs  have  been  previously  consid¬ 
ered;  affer  all,  one  musf  define  correcfness  before  arguing  fhaf  a  profocol  is  correcf.  Two  common 
approaches  are  to  exlend  linearizabilify  fo  apply  in  fhe  presence  of  faulfy  clienfs  and  to  define  a 
new  safely  condifion,  possibly  arguing  ils  similarily  fo  linearizabilify.  Section  3.3.2  considers  prior 
extensions  fo  linearizabilify,  buf  prior  exfensions  may  nof  ensure  fhe  expecfed  operational  semanfics 
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for  some  applications,  and  prior  extensions  may  preclude  some  protocols  that  provide  operational 
semantics  that  suffice  for  many  applications.  Defining  a  new  safefy  condition  is  even  worse,  as  if 
obfuscafes  fhe  operational  semantics  of  a  protocol.  A  non-sfandard  safefy  condifion  complicafes 
profocol  comparison,  obscures  which  faulfs  are  folerafed,  and  may  fail  to  ensure  fhe  expected  oper¬ 
ational  semantics. 

This  chapfer  proposes  a  minimal  correcfness  condifion  fhaf  is  resfricfive  enough  fo  prevenf 
anomalies  buf  permissive  enough  to  allow  all  prior  profocols.  For  applicafions  fhaf  require  sfricfer 
semantics,  fhis  chapfer  provides  a  mechanism  fo  sfrengfhen  fhis  condifion.  By  separafing  basic  cor¬ 
recfness  requiremenfs  from  additional  profocol  fealures,  fhis  chapfer  sfrives  fo  facilifafe  comparison 
of  profocol  guarantees.  Furthermore,  explicif  delineafion  of  profocol  fealures  ensures  fhaf  a  profocol 
provides  adequafe  operafional  semanfics  when  deployed  in  a  real  sysfem  wifhouf  unduly  resfricfing 
fhe  design  of  fhe  profocol. 

This  resf  of  fhis  chapfer  is  organized  as  follows.  Secfion  3.2  presenfs  a  formal  definilion  of 
linearizabilify  in  fhe  presence  of  faully  clienfs  fhaf  does  nof  preclude  previous  profocols  fhaf  appear 
correcf  fo  correcf  clienfs.  Secfion  3.3  provides  techniques  fo  ensure  sfricfer  operational  semanfics 
as  needed,  buf  notes  fhaf  such  semantics  are  nof  always  possible.  In  particular,  Secfion  3.3.2  defines 
immediate  recovery,  which  ensures  a  faully  clienl  cannol  affecl  a  block  sforage  or  alomic  register 
profocol  afler  fhe  clienl  has  been  revoked.  Secfion  3.3.2  Ihen  proves  fhaf  enlirely  wail-free  profocols 
cannof  provide  immediale  recovery.  In  conlrasf  fo  Secfion  3.3.2,  Sections  3.4  presenfs  fhe  firsl 
block  sforage  profocol  fhaf  provides  waif-free  reads  and  wrifes  and  ensures  immediate  recovery.  By 
necessify,  revocalion  is  nof  waif-free. 


3.1  Background 

This  chapfer  is  concerned  wilh  dislribuled  systems  in  which  a  sel  of  clienfs  perform  concurrenl 
operalions  on  objecfs  conlrolled  by  a  sel  of  servers.  The  sfandard  definilion  of  safefy  for  such  dis- 
Iribufed  sysfems  is  linearizabilify  as  defined  by  Herlihy  and  Wing  [51],  which  defines  safefy  from 
fhe  perspective  of  correcf  clienfs.  For  reference,  fhis  chapter  reslales  fhe  definilion  of  linearizabilify 
in  Section  3.2  as  Definition  3.2.2.  Proofs  of  safely  ensure  fhaf  a  profocol  adheres  fo  a  particular 
specification.  This  chapfer  argues  fhaf  previous  allempls  fo  define  safely  conflate  safely  wilh  addi¬ 
tional  profocol  fealures.  Addilional  fealures  may  be  necessary  for  a  parlicular  application,  buf  are 
nof  required  fo  ensure  safefy  in  fhe  Iradilional  sense.  Secfion  3.3  describes  fealures  fhaf  may  be 
useful  buf  go  beyond  safefy. 

When  mighl  Iradilional  nofions  of  safefy  be  insufficienf?  Consider  an  accounfing  firm  wilh 
a  shared  dislribuled  sforage  sysfem  in  an  asynchronous  environmenl.  As  is  necessary  in  many 
practical  sforage  profocols,  fhe  read  subprofocol  includes  a  write  back  step,  in  which  fhe  clienl 
wrifes  fhe  block  read  back  info  fhe  system  to  ensure  subsequenf  reads  are  up-lo-dafe  [28].  A  faully 
clienl — for  example,  a  compuler  operated  by  a  malicious  employee  fhaf  is  subsequenfly  fired — may 
partially  complefe  a  wrife,  such  fhaf  a  subsequenf  read  operalion  may  or  may  nof  refurn  fhe  value 
wrillen.  (For  example,  PASIS  would  classify  fhe  read  as  eilher  repairable  or  incomplete  [44].)  For 
many  days,  weeks,  or  years,  reading  fhe  block  may  nof  refurn  fhe  parlially-wrillen  value.  Buf,  af 
some  disfanf  poinf  in  fhe  fulure,  a  differenl  subsel  of  servers  may  cause  a  clienl  fo  read  and  hence 
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write  back  the  partial  write,  wreaking  havoc.  For  example,  the  faulty  client  could  overwrite  all 
blocks  with  zeroes,  with  the  consequence  only  discovered  long  after  the  client  has  left  the  system. 

The  Byzantine  fault-tolerant  storage  community  has  struggled  with  this  problem  for  several 
years.  Malkhi,  Reiter,  and  Lynch  [70]  extended  the  concept  of  linearizability  to  distributed  stor¬ 
age  systems  with  Byzantine-faulty  clients.  Their  proposal  requires  that  the  number  of  operations 
issued  by  faulty  clients  is  finite  if  all  faulty  clients  eventually  leave  the  system.  A  notable  draw¬ 
back  of  this  approach  is  that  it  only  applies  to  infinite  executions  of  a  system,  because  all  clients 
issue  finite  operations  in  finite  executions,  system  executions  are  finite  in  practice.  Liskov  and  Ro¬ 
drigues  [66]  defined  fwo  safefy  condifions  for  Byzanfine  faulf-folerant  storage,  BFT-linearizable  and 
BFT-linearizable-i-.  Unforfunafely,  several  reasonable  storage  profocols  meef  neifher  safefy  condi- 
fion,  and  neifher  condifion  precludes  cerfain  anomalies,  such  as  a  write  from  a  faully  clienf  faking 
effecf  years  after  fhe  clienf  was  removed  from  fhe  sysfem.  The  benefifs  and  drawbacks  of  fhese 
extensions  are  considered  in  Section  3.3.2.  Aguilera  and  Swaminafhan  described  a  properly  called 
limited  effect,  which  is  similar  in  infenfion  fo  immediate  recovery,  buf  fhey  achieve  fheir  properly 
by  implemenfing  a  weaker  primifive  (an  aborfable  register)  fo  allow  wail-freedom  [6]. 

There  have  been  several  recenf  profocols  fhal  folerale  faully  clienls  in  some  form.  Caslro  and 
Liskov  [24],  Abd-el-Malek  ef  al.  [2],  Cowling  ef  al.  [32],  and  Kolia  el  al.  [55]  all  consider  Byzanline- 
faully  clienls  in  replicaled  sfale  machines.  Malkhi  and  Reiter  [68],  Goodson  el  al.  [44],  Liskov  and 
Rodrigues  [66],  Cachin  and  Tessaro  [19],  and  Hendricks  ef  al.  [48]  all  consider  Byzanline-faully 
clienls  in  sforage  systems.  Because  fhere  is  no  agreed  upon  definilion  of  correcfness  in  fhe  presence 
of  faully  clienls,  protocol  designers  oflen  creafe  fheir  own  definilion.  Choosing  fhe  righl  definition 
of  correcfness  in  such  an  ad-hoc  manner  is  difficull  if  nol  perilous,  and  one  could  easily  imagine 
the  process  going  awry.  To  state  the  problem  another  way,  using  formal  methods  to  prove  ad-hoc 
criteria  burdens  the  protocol  designer  without  ensuring  traditional  notions  of  correctness. 


3.2  Safety  in  the  Presence  of  Faulty  Clients 

This  section  proposes  the  minimal  requirements  for  safety.  Informally,  for  a  protocol  to  be  correct 
in  the  presence  of  faulty  clients,  from  the  perspective  of  the  correct  clients,  any  execution  of  the 
protocol  could  have  happened  if  all  clients  were  correct.  Before  restating  this  more  formally,  define 
a  few  terms  following  fhe  terminology  of  Herlihy  and  Wing  [51,  Section  2.1].  A  client  (or  process) 
issues  operafions  in  sequence.  An  operation  consisfs  of  fwo  events:  an  invocation  and  a  mafching 
response.  The  execufion  of  a  disfribufed  sysfem  is  modeled  by  a  history,  which  is  an  ordered  se¬ 
quence  of  invocation  and  response  evenfs  by  clienls.  Each  even!  is  associaled  wilh  fhe  clienf  lhaf 
invoked  fhe  operalion.  A  prolocol  is  said  to  be  correct  if  any  execufion  of  fhe  prolocol  will  resulf  in 
a  history  lhaf  satisfies  fhe  correctness  condition.  Consider  fhe  following  definilion  of  an  extension 
fhal  accounfs  for  faully  clienls  to  a  correcfness  condifion  lhaf  does  nof: 

Definition  3.2.1.  A  history  H  comprised  of  evenfs  from  all  correcl  clienls  satisfies  a  faully-clienl 
extension  to  a  correcfness  condition  if  and  only  if  fhere  exisls  history  H  such  fhal 

1 .  H  salisfies  fhe  correcfness  condition  and 

2.  The  subsequence  of  H  consisfing  of  each  even!  from  every  correcl  clienf  is  equal  fo  H. 
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This  definition  satisfies  two  important  properties.  First,  in  the  absence  of  faulty  clients,  a  pro¬ 
tocol  will  satisfy  the  traditional  notion  of  correctness.  This  is  because  if  all  clients  are  correct,  H  is 
identical  to  H  and  thus  H  satisfies  the  correctness  condition.  Thus,  an  extension  to  linearizability 
will  imply  linearizability  in  the  absence  of  faulty  clients.  Second,  this  definition  prevents  anomalies. 
For  example,  a  state  machine  protocol  cannot  transition  to  an  unreachable  state  from  the  perspective 
of  correct  clients.  An  execution  of  a  protocol  is  only  correct  if,  from  the  perspective  of  each  correct 
client,  the  execution  could  have  resulted  from  a  set  of  clients  correctly  executing  the  protocol.  In 
other  words,  any  effects  of  faulty  clients  can  be  explained  away  as  though  such  faulty  clients  were 
correct. 

For  a  particular  correctness  condition  such  as  linearizability  in  the  presence  of  Byzantine-faulty 
clients.  Definition  3.2.1  can  be  specialized  as  a  Byzantine-faulty-client  extension  of  linearizability } 
Linearizability  relates  the  sequential  specification  of  a  protocol  to  the  correctness  of  its  distributed 
variant.  The  benefit  of  linearizability  is  that  the  sequential  specification  of  a  protocol  is  often  simpler 
than  its  distributed  equivalent  because  operations  may  not  be  totally  ordered  in  a  distributed  system; 
for  example,  the  sequential  specification  of  a  storage  register  is  that  a  read  returns  the  current  value 
and  a  write  sets  the  value  to  the  argument  of  the  write. 

Before  continuing,  define  a  few  more  terms  following  Herlihy  and  Wing  [51].  Two  histories 
are  equivalent  if,  for  each  client,  the  ordered  subsequences  of  events  associated  with  that  client 
from  either  history  are  the  same.  A  history  is  sequential  if  the  first  event  is  an  invocation;  every 
invocation,  except  possibly  the  last,  is  followed  by  a  matching  response;  and  every  response  is 
preceded  by  a  matching  invocation.  A  legal  sequential  history  is  a  sequential  history  that  conforms 
to  the  sequential  specification.  A  history  H  can  be  extended  by  including  a  matching  response  for 
some  of  its  unmatched  invocations,  and  it  can  be  completed  by  omitting  unmatched  invocations. 
Recall  the  definition  of  a  linearizable  history  [51,  Section  2.2]: 

Definition  3.2.2.  A  history  H  is  linearizable  if 

1.  H  can  be  extended  to  some  history  H'  such  that  complete(//')  is  equivalent  to  some  legal 
sequential  history  S  and 

2.  for  each  response  that  precedes  some  invocation  in  H,  the  response  also  precedes  that  invoca¬ 
tion  in  S. 

A  Byzantine  extension  to  linearizability  is  then  defined  as  follows: 

Definition  3.2.3.  A  history  H  comprised  of  events  from  all  correct  clients  satisfies  a  Byzantine- 
faulty-client  extension  to  linearizability  if  there  exists  some  history  H  such  that 

1.  H  can  be  extended  to  some  history  H'  such  that  complete(//')  is  equivalent  to  some  legal 
sequential  history  S, 

2.  For  each  response  that  precedes  some  invocation  in  H,  the  response  also  precedes  that  invo¬ 
cation  in  S,  and 

3.  The  subsequence  of  H  consisting  of  each  event  from  every  correct  client  is  equal  to  H. 

Definition  3.2.3  is  a  direct  application  of  Definition  3.2. 1  to  Definition  3.2.2,  making  Definition  3.2.3 
the  natural  extension  of  linearizability  in  the  presence  of  faulty  clients.  The  first  and  second  terms 

*Or  a  Byzantine  extension  of  linearizability,  or  just  Byzantine  linearizability,  for  short. 
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restate  Definition  3.2.2.  The  first  term  corresponds  to  Definition  3.2.1.  The  second  term  is  simplified 
by  using  H  rafher  than  H.  The  second  term  can  use  H  rather  than  H  because  any  H  that  satisfies 
the  first  and  third  term  would  still  satisfy  these  terms  if  reordered  such  that  responses  to  faulty 
clients  do  not  precede  any  invocations  and  invocations  by  faulty  clients  precede  all  responses.  Thus, 
invocations  and  responses  from  faulty  clients  need  not  apply  to  the  second  term,  so  H  is  equivalent 
to  H  for  this  purpose. 

3.3  Stricter  Extensions  of  Linearizability 

A  protocol  that  tolerates  faulty  clients  must  ensure  that  all  protocol  executions  satisfy  Defini¬ 
tion  3.2.3,  but  Definition  3.2.3  may  not  ensure  sufficient  operational  semantics  for  some  applica¬ 
tions.  This  section  considers  additional  protocol  features  that  restrict  the  ability  of  faulty  clients  to 
invoke  operations.  Protocol  designers  should  describe  such  features  when  specifying  what  seman¬ 
tics  a  protocol  implementation  provides.  This  section  considers  invocation  criteria  and  recovery. 
Most  protocols  should  at  least  consider  invocation  criteria,  which  describes  when  an  invocation 
from  a  faulty  (or  correct)  client  is  allowed  to  appear  in  a  history. 

3.3.1  Invocation  Criteria 

Recall  that  history  H  in  Definition  3.2.3  includes  events  associated  with  both  correct  and  faulty 
clients.  A  faulty-client  invocation  criteria  defines  the  conditions  under  which  invocations  from 
faulty  clients  may  be  present  into  H.  By  including  invocation  criteria  when  describing  protocol 
semantics,  the  effects  of  faulty  clients  can  be  made  obvious  and  application  designers  can  determine 
if  a  protocol  provides  acceptable  semantics. 

Definition  3.3.1.  The  faulty-client  invocation  criteria  is  a  set  of  requirements  that  must  be  satis¬ 
fied  for  each  invocation  associated  with  a  faulty  client. 

A  useful  invocation  criteria  is  that  a  faulty  client  must  successfully  complete  a  remote  procedure 
call  at  a  correct  server  for  each  invocation  in  H  associated  with  that  client.  This  ensures  that  a 
faulty  client  must  perform  a  particular  action  in  order  to  affect  the  state  of  the  system.  Specifying 
invocation  criteria  is  a  useful  sanity  check  in  the  design  of  storage  protocols,  and  allows  application 
designers  to  choose  an  appropriate  protocol. 

3.3.2  Recovery 

The  notion  of  recovery  from  faulty  clients  has  been  proposed  as  a  desirable  or  necessary  property 
of  storage  protocols  that  tolerate  Byzantine-faulty  clients.  Recovery  can  also  apply  to  non-storage 
protocols.  Recovery  is  usually  described  in  relation  to  a  special  Stop  event  in  an  execution  history. 
After  a  Stop  event  is  issued  for  a  client,  the  client  can  issue  only  a  limited  number  of  operations. 
In  practice,  a  Stop  event  is  issued  when  a  faulty  client  is  repaired  or  removed  from  the  system, 
perhaps  after  being  detected  as  faulty.  The  least  restrictive  variant  of  recovery  is  eventual  recovery. 

Definition  3.3.2.  A  distributed  protocol  eventually  recovers  from  faulty  clients  if  each  faulty 
client  issues  only  a  finite  number  of  operations  after  a  Stop  event  is  issued  for  all  faulty  clients. 


26 


CHAPTER  3.  CORRECTNESS  IN  THE  PRESENCE  OEEAUETY  CEIENTS 


Eventual  recovery  was  proposed  by  Malkhi  et  al.  [70]  as  “Byznearizability.”  It  ensures  that 
faulty  clients  cannot  affect  the  state  of  the  system  after  all  faulty  clients  are  removed  and  enough 
time  has  passed.  Unfortunately,  eventual  recovery  provides  few  guarantees  in  practice — any  finite 
history  trivially  satisfies  evenfual  recovery,  and  all  hisfories  encounfered  in  pracfice  are,  of  course, 
finite. 

Despile  Ihis  crilicism,  demonslraling  lhal  a  protocol  evenfually  recovers  is  useful  because  pro¬ 
tocols  lhal  do  nof  evenfually  recover  exhibil  slrange  semantics:  even  afler  removal  from  fhe  system, 
faulfy  clienfs  continuously  change  fhe  sfafe  of  fhe  sysfem.  Evenfual  recovery  is  offen  implied  by 
a  reasonable  faulfy-clienl  invocation  criteria.  Eor  example,  if  a  clienf  musf  communicafe  wifh  a 
correcf  server  fo  invoke  an  operafion,  fhe  protocol  will  evenfually  recover  from  faulfy  clienfs.  A 
slighlly  more  reslriclive  properly  is  bounded  recovery: 

Definition  3.3.3.  A  dislribuled  protocol  recovers  after  bounded  operations  if  for  some  finite 
conslanl  b,  afler  a  Stop  evenl  is  issued  for  a  faulfy  clienf  fhe  clienf  issues  af  mosl  b  operations. 

Bounded  recovery  was  proposed  by  Eiskov  and  Rodrigues  [66]  as  “BET-linearizabilify.”  Eiskov 
and  Rodrigues  furlher  refined  fhe  definifion  fo  ensure  lhal  operalions  cannof  be  issued  by  a  faulfy 
clienf  after  a  Stop  evenl  is  issued  for  fhe  clienf  and  fhe  sfafe  of  fhe  sysfem  is  overwrillen  k 
limes.  They  call  Ibis  slricler  definifion  “BET-linearizabilily-i-.”  Sforage  protocols  lhat  salisfy  BET- 
linearizabilify  or  BET-linearizabilify -I-  bound  fhe  number  of  “lurking  wrifes”  [66],  which  are  wriles 
by  a  clienf  lhal  occur  afler  fhe  clienf  has  leff  fhe  sysfem. 

Unforlunafely,  bounded  recovery  may  be  foo  reslriclive — mosl  recent  storage  protocols  do  not 
ensure  bounded  recovery  with  no  apparent  ill  consequences  [19,  44,  48,  68].  Augmenting  a  storage 
protocol  to  ensure  bounded  recovery  may  be  a  sizable  undertaking.  Eurthermore,  for  large  values  of 
b,  faulty  clients  can  issue  many  operations  after  a  Stop  event,  as  in  eventual  recovery — perhaps  too 
many.  Yet,  the  benefits  of  even  small  bfoxb>0  are  not  clear;  in  the  hypothetical  accounting  firm 
of  Section  3.1,  bounded  recovery  ensures  only  that  data  is  lost  a  bounded  number  of  times,  which  is 
b  times  too  many.  Thus,  bounded  recovery  may  be  too  restrictive  but  also  not  restrictive  enough  in 
practice. 

The  most  restrictive  variant  of  recovery  is  immediate  recovery: 

Definition  3.3.4.  A  distributed  protocol  immediately  recovers  from  faulty  clients  if  after  a  Stop 
event  is  issued  for  a  client  the  client  issues  no  more  operations. 

Immediate  recovery  is  useful  in  practice  because  it  allows  for  the  standard  access  control 
semantics — once  a  client’s  access  has  been  revoked,  the  client  cannot  issue  any  more  operations. 
Revoking  the  access  rights  of  a  client  can  be  modeled  as  issuing  a  Stop  event  for  that  client.  Un¬ 
fortunately,  immediate  recovery  may  restrict  the  liveness  properties  of  a  protocol. 

Theorem  3.3.5.  Suppose  a  protocol  using  n  processes  implements  an  object  in  an  asynchronous 
environment  that  tolerates  at  least  one  crash-fault  with  operations  read,  write,  and  revoke.  Read 
returns  the  current  value,  write  updates  the  value,  and  revoke  removes  write  permission  for  a  specific 
process.  Then  this  protocol  is  not  wait-free  [49]. 

Proof.  Eet  each  process  begin  with  write  access.  Each  process  will  attempt  to  write  its  proposed 
value  to  the  object,  revoke  write  access  to  every  other  process,  and  then  read  the  current  value  of 
the  object.  Because  reading  comes  after  revoking,  and  because  immediate  recovery  ensures  that  no 
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writes  complete  after  revocation,  this  algorithm  implements  n-process  consensus,  where  the  value 
read  is  the  value  agreed  upon.  Thus,  by  the  FLP  impossibility  result  [40],  the  protocol  cannot  always 
be  live.  □ 

Though  Theorem  3.3.5  precludes  distributed  objects  with  wait-free  read,  write,  and  revoke  opera¬ 
tions  that  tolerate  even  a  single  crash.  Section  3.4  will  describe  a  protocol  that  offers  immediate  re¬ 
covery  with  wait-free  read  and  write  operations,  but  a  revoke  operation  that  is  not  wait-free.  In  other 
words,  a  storage  object  can  provide  wait-free  read  and  write  operations,  with  the  liveness  penalty  of 
immediate  recovery  only  experienced  when  a  revoke  operation  makes  the  penalty  unavoidable. 

3.4  A  Wait-free  Storage  Protocol  with  Immediate  Recovery 

This  section  presents  a  protocol  that  offers  immediate  recovery  with  wait-free  read  and  write  op¬ 
erations  (revoke  operations  are  not  wait-free).  The  protocol  reads  and  writes  from  a  single  block. 
An  array  of  blocks  can  be  formed  by  running  multiple  instances  of  the  protocol  in  parallel.  The 
protocol  requires  3/  +  1  total  servers  to  tolerate  /  Byzantine  faulty  servers,  and  it  tolerates  any 
number  of  Byzantine  faulty  clients.  The  network  is  assumed  to  be  asynchronous.  The  client  retries 
sending  requests  across  the  network  until  it  receives  enough  responses  to  continue,  but  this  retry 
pattern  is  not  shown.  The  client  has  access  to  a  replicated  state  machine  subprotocol,  such  as  PBFT 
or  Zyzzyva  [24,  55]  which  will  be  used  for  the  revoke  operation. 

The  protocol  is  comprised  of  three  operations:  Write,  Read,  and  Revoke.  The  protocol  is  based 
on  the  timestamp  paradigm  found  in  many  prior  protocols  (e.g.,  [44, 48,  66]),  and  is  similar  to  that  of 
Liskov  and  Rodrigues  [66] .  For  a  write,  a  client  chooses  a  timestamp  higher  than  the  most  recently 
completed  write,  and  is  said  to  write  its  block  at  that  timestamp.  A  reader  finds  the  block  with  the 
highest  timestamp.  If  two  blocks  share  the  same  timestamp,  the  tie  is  broken  deterministically.  In 
particular,  in  this  protocol  the  block  value  that  would  represent  the  greater  binary  number  is  chosen. 
If  the  two  block  values  are  the  same,  either  block  is  returned. 

The  protocol  is  designed  for  simplicity,  as  to  prove  that  storage  protocols  can  offer  immediate 
recovery  and  wait-free  read  and  write  operations.  The  protocol  was  not  designed  for  efficiency,  and 
fhere  are  a  number  of  opfimizafions  fhaf  are  nof  considered.  Because  fhe  profocol  is  so  similar  fo 
prior  profocols  (e.g.,lhe  profocol  of  Liskov  and  Rodrigues  [66]),  fhe  profocol  presenfafion  is  kepf  as 
informal  as  possible. 

3.4.1  Write 

The  write  subprotocol  is  comprised  of  three  rounds:  First,  the  client  finds  fhe  highesf  ts.  Second, 
fhe  client  sends  a  prepare  request,  which  includes  the  block  value.  Third,  the  client  sends  a  commit 
request,  which  includes  proof  that  prepare  requests  were  received  by  2/  -|-  1  servers.  In  the  first 
round,  the  client  requests  the  highest  ts  value  from  each  server.  Each  correct  server  returns  with  the 
greatest  ts  of  any  commit  request  processed.  After  receiving  2/4-1  responses,  the  client  chooses  a 
ts  one  greater  than  the  greatest  ts  value  returned. 

The  client  then  sends  a  signed  prepare  request  to  all  servers,  which  consists  of  the  chosen  ts  and 
the  block.  Each  correct  server  stores  the  signed  prepare  request  if  it  has  the  greatest  (ts,  block)  pair 
from  this  client  so  far,  and  returns  with  a  signed  response  which  consists  of  a  tag  denoting  this  is 
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a  prepare  response,  the  ts,  and  the  block  (in  practice,  only  a  collision-resistant  hash  of  the  block  is 
required).  If  a  client’s  write  privileges  have  been  revoked,  a  server  may  return  a  FAILURE  message, 
which  includes  the  signed  revocation  request,  in  which  case  the  write  fails,  the  write  protocol  returns 
FAILURE  to  the  application,  and  a  correct  client  no  longer  issues  write  operations. 

After  receiving  2/  +  1  prepare  responses,  the  client  forms  a  prepare  certificate,  which  is  just  the 
pair  (ts,  block)  and  the  2/  +  1  signed  prepare  responses.  The  client  sends  a  commit  request  to  all 
servers,  which  consists  of  the  prepare  certificate.  Each  correct  server  stores  the  prepare  certificate  as 
the  new  most  recent  write  if  the  pair  (ts,  block)  is  the  greatest  received  from  any  client  so  far,  even 
if  the  client  has  had  its  write  privileges  revoked.  A  correct  server  then  sends  an  acknowledgment  to 
the  client.  Upon  receiving  2/  +  1  acknowledgments  of  the  commit  request,  the  write  is  complete, 
and  the  write  protocol  returns  SUCCESS  to  the  application. 

Liveness:  The  write  subprotocol  is  wait-free.  Each  round  consists  of  the  client  sending  a  request 
to  3/  -I-  1  servers  and  waiting  for  responses  from  2/  -|-  1  servers.  Eventually,  the  2/  -|-  1  correct 
servers  will  return  some  response,  and  the  client  will  continue. 

3.4.2  Read 

The  read  subprotocol  is  comprised  of  two  rounds.  Eirst,  the  client  asks  each  server  for  the  prepare 
certificate  provided  by  a  commit  request  with  the  highest  timestamp.  (Or  some  default  block  if 
no  block  has  been  committed.)  After  receiving  2/-|-  1  prepare  certificates,  the  client  chooses  the 
prepare  certificate  with  the  highest  timestamp.  Second,  the  client  writes  back  a  commit  request 
consisting  of  the  prepare  certificate  to  each  server  and  waits  for  2/  -|-  1  acknowledgments.  The  read 
subprotocol  then  returns  the  block  from  that  prepare  certificate  to  the  application. 

Liveness:  The  read  subprotocol  is  wait-free  for  the  same  reasons  that  the  write  subprotocol  is 
wait-free. 

3.4.3  Revoke 

The  revoke  subprotocol  has  three  rounds,  in  which  a  client  called  the  revoker  revokes  write  privi¬ 
leges  for  another  client.  Eirst,  the  revoker  asks  each  server  for  the  prepare  request  with  the  greatest 
(ts,  block)  pair  for  the  client  being  revoked  in  a  signed  revocation  request  (or  a  default  block). 
Each  server  promises  to  not  accept  future  prepare  requests  from  this  client  (though  a  client  can  still 
complete  a  commit  request  if  it  already  has  a  prepare  certificate),  and  each  server  stores  the  signed 
revocation  request.  The  revoker  gathers  2/  -|- 1  such  prepare  requests  and  such  promises  not  to  allow 
further  prepares. 

Second,  the  revoker  sends  these  2/  -|-  1  non-matching  prepare  requests  to  the  replicated  state 
machine  subprotocol.  If  this  is  the  first  revocation  request  for  the  client  being  revoked,  the  replicated 
state  machine  protocol  returns  the  prepare  request  with  the  greatest  (ts,  block)  pair,  along  with  a 
signed  message  that  allows  the  prepare  request  to  be  used  as  a  prepare  certificate.  If  this  is  not  the 
first  revocation  request  for  the  client  being  revoked,  the  replicated  state  machine  protocol  returns 
the  prepare  request  it  sent  for  the  first  revocation. 

Third,  the  revoker  uses  the  signed  message  and  the  prepare  request  to  write  back  a  commit 
request  to  each  server.  After  2/  -|-  1  acknowledgments,  the  revoker  is  done  and  returns  SUCCESS  to 
the  application. 
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An  agreement  subprotocol,  as  implied  by  Theorem  3.3.5,  is  provided  in  the  second  round  by  the 
replicated  state  machine.  The  agreement  subprotocol  ensures  that  a  unique  prepare  request  is  the 
final  prepare  request  committed  during  write  back.  Otherwise,  two  concurrent  revoke  operations 
could  write  back  different  values,  which  would  allow  a  write  for  the  client  being  revoked  to  appear 
after  the  first  revoke  operation  completes. 

Liveness:  The  revoke  subprotocol  inherits  the  liveness  properties  of  the  underlying  replicated 
state  machine  subprotocol.  The  non-state-machine  requests  consist  of  the  revoker  sending  a  request 
to  3/  -|-  1  servers  and  waiting  for  2/  -|-  1  responses,  as  in  the  write  and  read  subprotocols,  so  the 
revoker  will  continue  once  the  2/  -I-  1  correct  servers  respond. 

3.4.4  Linearizability  and  Immediate  Recovery 

To  see  why  this  protocol  is  linearizable  from  the  perspective  of  correct  clients,  consider  the  sequence 
of  read  and  write  operations  from  correct  clients  ordered  by  (ts,  block)  pairs,  where  operations 
with  the  same  (ts,  block)  value  are  ordered  by  real-time  precedence,  with  writes  preceding  reads 
whenever  possible.  Insert  revoke  operations  by  correct  revoking  clients  as  late  as  possible  while 
respecting  real-time  precedence,  but  before  any  subsequent  FAILURE  for  a  write  by  a  correct  client 
being  revoked.  Finally,  if  the  value  of  block  for  the  write  that  immediately  precedes  a  read  does 
not  match  the  value  of  block  read,  insert  a  write  by  a  faulty  client  that  is  not  yet  revoked  for  the 
(ts,  block)  pair  that  is  read.  Insert  the  write  immediately  after  the  preceding  read  or  write  for  a 
different  (ts,  block)  pair,  but  before  any  revoke  for  the  faulty  client.  Call  this  sequence  of  operations 
sequential  history  S. 

It  is  always  possible  to  insert  a  write  for  a  faulty  client  as  described  above.  Suppose  a  correct 
client  reads  a  (ts,  block)  pair,  but  that  the  read  response  precedes  the  invocation  any  write  by  a 
correct  client  for  that  pair.  The  read  returns  a  signed  prepare  certificate,  which  means  that  a  signed 
prepare  request  must  have  been  generated  for  some  client  for  the  pair  (ts,  block).  If  the  prepare 
request  was  not  signed  by  a  correct  client,  it  must  correspond  to  a  faulty  client. 

Note  that  if  a  read  or  write  precedes  another  read  or  write,  the  (ts,  block)  pair  will  not  decrease 
because  write  back  ensures  that  the  (ts,  block)  pair  committed  to  at  least  2/+  1  servers  is  at  least 
as  great  as  that  of  the  first  read  or  write.  Thus,  the  subsequence  of  S  consisting  of  each  event  for 
any  correct  client  is  the  same  as  the  the  subsequence  of  events  for  that  correct  client  in  history  H, 
the  sequence  of  events  from  all  correct  clients.  Each  read,  write,  and  revoke  is  present,  and  their 
relative  ordering  is  consistent  with  real-time  precedence. 

By  construction,  the  write  that  immediately  precedes  a  read  writes  a  matching  value  of  block. 
Furthermore,  if  a  revoke  for  a  correct  client  precedes  a  write  by  that  client,  one  of  the  2/  -|-  1  servers 
that  promised  to  reject  future  prepare  requests  will  return  FAILURE,  so  the  write  will  return  failure. 
Also,  if  a  write  by  a  correct  client  returns  failure,  then  the  client  received  a  revocation  request  that 
must  have  been  signed  by  a  client  prior  to  the  write  invocation.  Thus,  5  is  a  legal  sequential  history. 
Also,  each  response  that  precedes  some  invocation  in  H  precedes  that  invocation  in  S.  Thus,  the 
protocol  satisfies  the  natural  Byzantine  extension  to  linearizability. 

Immediate  recovery:  Immediate  recovery  follows  from  Definition  3.3.4,  the  definition  of  the 
revoke  operation  in  this  protocol,  and  because  the  read,  write,  and  revoke  protocol  described  in  this 
section  satisfies  the  natural  Byzantine  extension  to  linearizability. 
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3.5  Conclusion 

Distributed  protocols  use  correctness  arguments  to  ensure  that  applications  experience  the  expected 
operational  semantics.  Unfortunately,  correctness  arguments  are  challenging  in  the  presence  of 
faulty  clients  because  traditional  notions  of  correctness,  such  as  linearizability,  assume  that  all  nodes 
are  correct.  This  chapter  presented  the  natural  extension  to  linearizability  in  environments  with 
faulty  clients.  It  also  proposed  immediate  recovery,  a  new  correctness  criterion  that  ensures  that 
faulty  clients  cannot  affect  the  state  of  a  storage  system  after  being  removed.  Furthermore,  the 
chapter  demonstrated  that  a  Byzantine  fault-tolerant  storage  protocol  can  provide  wait-free  reads 
and  writes  yet  still  ensure  immediate  recovery. 


Chapter  4 


Low-Overhead  Byzantine  Fault-Tolerant 
Storage 


Distributed  storage  systems  must  tolerate  faults  other  than  crashes  as  such  systems  grow  in  size 
and  importance.  Protocols  that  can  tolerate  arbitrarily  faulty  behavior  by  components  of  the  system 
are  said  to  be  Byzantine  fault-tolerant  [65].  Most  Byzantine  fault-tolerant  protocols  are  used  to 
implement  replicated  state  machines,  in  which  each  request  is  sent  to  a  server  replica  and  each 
non-faulty  replica  sends  a  response.  Replication  does  not  introduce  unreasonable  overhead  when 
requests  and  responses  are  small  relative  to  the  processing  involved,  but  for  distributed  storage,  large 
blocks  of  data  are  often  transferred  as  a  part  of  an  otherwise  simple  read  or  write  request.  Though 
a  single  server  can  return  the  block  for  a  read  request,  the  block  must  be  sent  to  each  server  for  a 
write  request. 

A  storage  protocol  can  reduce  the  amount  of  data  that  must  be  sent  to  each  server  by  using  an 
m-of-n  erasure  code  [84,  85] .  Each  block  is  encoded  into  n  fragments  such  that  any  m  fragments 
can  be  used  to  decode  the  block.  Unfortunately,  existing  protocols  that  use  erasure  codes  struggle 
with  tolerating  Byzantine  faulty  clients.  Such  clients  can  write  inconsistently  encoded  fragments, 
such  that  different  subsets  of  fragments  decode  into  different  blocks  of  data. 

Existing  protocols  that  use  erasure  codes  either  provide  each  server  with  the  entire  block  of  data 
or  introduce  expensive  techniques  to  ensure  that  fragments  decode  into  a  unique  block.  In  the  first 
approach,  the  block  is  erasure-coded  at  the  server  [18],  which  saves  disk  bandwidth  and  capacity  but 
not  network  bandwidth.  The  second  approach  [44]  saves  network  bandwidth  but  requires  additional 
servers  and  a  relatively  expensive  verification  procedure  for  read  operations.  Eurthermore,  in  this 
approach,  all  writes  must  be  versioned  because  clients  may  need  to  read  several  versions  of  a  block 
before  read  verification  succeeds,  and  a  separate  garbage  collection  protocol  must  be  run  to  free  old 
versions  of  a  block  [3]. 

This  chapter  takes  a  different  approach.  It  proposes  novel  mechanisms  to  optimize  for  the  com¬ 
mon  case  when  faults  and  concurrency  are  rare.  These  optimizations  minimize  the  number  of  rounds 
of  communication,  the  amount  of  computation,  and  the  number  of  servers  that  must  be  available  at 
any  time.  Also,  this  chapter  employs  the  homomorphic  fingerprinting  primitive  developed  in  Chap¬ 
ter  2,  to  ensure  that  a  block  is  encoded  correctly.  Homomorphic  fingerprinting  eliminates  the  need 
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for  versioned  storage  and  a  separate  garbage  collection  protocol,  and  it  minimizes  the  verification 
performed  during  read  operations. 

Analysis  and  measurements  of  a  distributed  block  storage  prototype  demonstrate  throughput 
close  to  that  of  a  system  that  tolerates  only  benign  crash  faults  and  well  beyond  the  throughput 
realized  by  competing  Byzantine  fault-tolerant  approaches.  The  protocol  achieves  within  10%  of 
the  throughput  of  an  ideal  crash  fault-tolerant  system  when  reading  or  writing  64  kB  blocks  of  data 
and  tolerating  up  to  6  faulty  servers  and  any  number  of  faulty  clients.  Across  a  range  of  values  for 
the  number  of  faulty  servers  tolerated,  the  protocol  outperforms  competing  approaches  during  write 
or  read  operations  by  more  than  a  factor  of  two. 


4.1  Background 

Reliability  has  long  been  a  primary  requirement  of  storage  systems.  Thus,  most  non-personal  stor¬ 
age  servers  (whether  disk  arrays  or  file  servers)  are  designed  fo  folerafe  faulfs  of  af  leasf  some  com- 
ponenfs.  Until  recenfly,  tolerance  of  any  single  componenf  faull  was  considered  sufficienl  by  many, 
buf  larger  systems  have  pushed  developers  foward  folerafing  mulfiple  componenf  faulfs  [31,41]. 

Faulfs  are  folerafed  via  redundancy.  In  fhe  case  of  storage  systems,  dafa  is  sfored  redundanfly 
across  mulfiple  disk  drives  in  order  fo  folerafe  faulfs  in  a  subsef  of  fhem.  Two  common  forms  of 
dafa  redundancy  are  replicafion  and  m-of-n  erasure  coding,  in  which  a  block  of  dafa  is  encoded 
into  n  fragmenfs  such  fhaf  any  m  can  be  used  fo  reconsfrucf  fhe  original  block.  For  disk  arrays,  fhe 
Irade-off  befween  fhe  fwo  has  been  exfensively  explored  for  fwo-way  miiToring  (i.e.,  fwo  replicas) 
versus  Raid-5  (i.e.,  {n  —  l)-of-n  erasure  coding)  or  Raid-6  {{n  —  2)-of-n  erasure  coding).  Mirroring 
performs  well  when  the  number  of  faults  tolerated  is  small  or  when  writes  are  small,  Raid-5  and 
Raid-6  perform  better  for  large  writes,  and  all  three  perform  well  for  reads  [27,  106].  The  Google 
File  System  (GFS)  [41],  for  example,  uses  replication  for  a  mostly-read  workload  and  by  default 
tolerates  two  faults. 

For  distributed  storage,  as  the  number  of  faults  tolerated  grows  beyond  two  or  three,  erasure 
coding  provides  much  better  write  bandwidth  [100,  105].  A  few  distributed  storage  systems  support 
erasure  coding.  For  example.  Zebra  [47],  xFS  [8],  and  PanFS  [77]  support  parity -based  protection 
of  data  striped  across  multiple  servers.  Fab  [91],  Ursa  Minor  [1],  and  RepStore  [108]  support  more 
general  m-of-n  erasure  coding. 

4.1.1  Beyond  Crash  Faults 

A  common  assumption  is  that  tolerance  of  crash  faults  is  sufficient  for  distributed  storage  systems, 
but  an  examination  of  a  modern  centralized  storage  server  shows  this  assumption  to  be  invalid.  Most 
such  servers  integrate  various  checksum  and  scrubbing  mechanisms  to  detect  non-crash  faults.  One 
common  approach  is  to  store  a  checksum  with  every  block  of  data  and  then  verify  that  checksum 
upon  every  read  (e.g.,  ZFS  [14]  and  GFS  [41]).  This  checksum  can  be  used  to  detect  problems  such 
as  when  the  device  driver  silently  corrupts  data  [14,  41]  or  when  the  disk  drive  writes  data  to  the 
wrong  physical  location  [1,  14]. 

For  example,  if  a  disk  drive  overwrites  the  wrong  physical  block,  a  checksum  may  be  able  to 
detect  this  corruption,  but  only  if  the  checksum  is  not  overwritten  as  well.  To  prevent  the  checksum 
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from  being  incorrect  along  with  the  data,  the  checksum  is  often  stored  separately  (e.g.,  with  the 
metadata  [14,  41]). 

In  short,  various  mechanisms  are  applied  to  detect  and  recover  from  non-crash  faults  in  modern 
storage  systems.  These  mechanisms  are  chosen  and  combined  in  an  ad  hoc  manner  based  on  the 
collective  experience  of  the  organizations  that  design  storage  systems.  Such  mechanisms  are  inher¬ 
ently  ad  hoc  because  no  fault  model  can  describe  just  the  types  of  faults  that  must  be  handled  in 
distributed  storage;  as  systems  change,  the  types  of  faults  change.  In  the  absence  of  a  more  specific 
fault  model,  general  Byzantine  fault  tolerance  [65]  can  be  used  to  cover  all  possibilities. 

4.1.2  The  Cost  of  Byzantine  Fault  Tolerance 

Byzantine  fault-tolerant  m-of-n  erasure-coded  storage  protocols  require  at  least  m  4-  2f  servers  to 
tolerate  /  faulty  servers.  Requiring  this  many  servers  is  less  imposing  than  it  sounds;  modern  non- 
Byzantine  fault-tolerant  erasure-coded  storage  arrays  already  use  a  similar  number  of  disk  drives. 
A  typical  storage  array  will  have  several  primary  disk  drives  that  store  unencoded  data  (e.g.,  m  =  5) 
and  a  few  parity  drives  for  redundancy  (e.g.,  /  =  2).  Beyond  these  drives,  however,  an  array  will 
often  include  a  pool  of  hot  spares  with  /  or  more  drives  (sometimes  shared  with  neighboring  arrays). 
The  reason  for  this  setup  is  that  once  a  drive  fails  or  becomes  otherwise  unresponsive,  the  storage 
array  must  either  halt  or  decrease  the  number  of  faults  that  it  tolerates  unless  it  replaces  that  drive 
with  a  hot  spare.  This  setup  is  similar  to  providing  m  +  f  responsive  drives  to  a  Byzantine  fault- 
tolerant  protocol  but  making  an  additional  /  drives  available  as  needed  (for  a  similar  approach,  see 
Rodrigues  et  al.  [89]).  In  other  words,  the  specter  of  additional  hardware  should  not  scare  developers 
away  from  Byzantine  fault  tolerance. 

Byzantine  fault-tolerant  protocols  often  require  additional  computational  overhead.  For  exam¬ 
ple,  the  protocol  proposed  in  this  chapter  requires  data  to  be  cryptographically  hashed  for  each  write 
and  read  operation.  This  overhead,  however,  is  less  significant  than  it  appears  for  two  reasons.  First, 
data  must  be  hashed  anyway  if  it  is  to  be  authenticated  when  sent  over  the  network.  Of  course, 
data  must  be  structured  properly  to  use  a  hash  for  both  authentication  and  fault  tolerance.  Second, 
many  modern  file  sysfems  hash  dafa  anyway.  For  example,  ZFS  supporfs  hashing  all  dafa  wifh 
SHA-256  [99],  and  EMC’s  Cenfera  hashes  all  dafa  wifh  eifher  MD5  or  a  concafenafion  of  MD5  and 
SHA-256  fo  provide  confenf  addressed  storage  [82]. 

4.1.3  Byzantine  Fault-Tolerant  Storage 

Many  Byzanfine  faulf-foleranf  profocols  are  used  fo  implemenf  replicafed  sfafe  machines.  Imple- 
menfafions  of  recenf  profocols  can  be  quite  efficienl  due  to  several  opfimizafions.  For  example, 
Casfro  and  Liskov  eliminafe  public-key  signafures  from  fhe  common  case  by  replacing  signafures 
wifh  message  aufhenlicafion  codes  (MACs)  and  lazily  refrieving  signafures  in  cases  of  failure  [23]. 
Abd-El-Malek  ef  al.  use  aggressive  opfimisfic  fechniques  and  quorums  fo  scale  as  fhe  number  of 
faulfs  folerafed  is  increased,  buf  fheir  profocol  requires  5/-|-  1  servers  fo  folerafe  /  faulfy  servers  [2]. 
Cowling  el  al.  use  a  hybrid  of  fhese  fwo  profocols  fo  achieve  good  performance  wifh  only  3/  -|-  1 
servers  [32].  Kolia  and  Dahlin  further  improve  performance  by  using  applicalion-specific  informa- 
lion  fo  allow  parallelism  [57].  Though  a  Byzanfine  faull-foleranl  replicated  sfafe  machine  profocol 
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can  be  used  to  implement  a  block  storage  protocol,  doing  so  requires  writing  data  to  at  least  /  +  1 
replicas. 

When  writing  large  blocks  of  data  and  tolerating  multiple  faults,  a  Byzantine  fault-tolerant  stor¬ 
age  protocol  should  provide  erasure-coded  fragments  to  each  server  to  minimize  the  bandwidth 
overhead  of  redundancy.  Writing  erasure-coded  fragments  has  been  difficult  to  achieve  because 
servers  must  ensure  that  a  block  is  encoded  correctly  without  seeing  the  entire  block.  Goodson  et  al. 
introduced  Pasis,  a  Byzantine  fault-tolerant  erasure-coded  block  storage  protocol  [44].  In  Pasis, 
servers  do  not  verify  that  a  block  is  correctly  encoded  during  write  operations.  Instead,  clients  verify 
during  read  operations  that  a  block  is  correctly  encoded. 

This  technique  avoids  the  problem  of  verifying  erasure-coded  fragments  but  introduces  a  few 
new  ones.  First,  fragments  must  be  kept  in  versioned  storage  [97]  because  clients  may  need  to 
read  several  versions  of  a  block  before  finding  a  version  fhaf  is  encoded  correcfly.  Second,  fhe  read 
verificafion  process  is  computafionally  expensive.  Third,  PASIS  requires  4/4-1  servers  fo  folerafe  / 
faulfs.  Fourfh,  a  separafe  garbage  collecfion  profocol  musf  evenfually  be  run  fo  free  old  versions  of  a 
block.  A  lazy  verificafion  profocol,  which  also  performs  garbage  collecfion,  was  proposed  fo  reduce 
fhe  impacf  of  read  verificafion  by  performing  if  in  fhe  background  [3],  buf  fhis  profocol  consumes 
significanf  bandwidfh. 

Cachin  and  Tessaro  infroduced  Avid  [18],  an  asynchronous  verifiable  informalion  dispersal 
profocol,  which  fhey  used  fo  build  a  Byzanfine  faull-loleranl  block  sforage  protocol  fhaf  requires 
only  3/4-1  servers  fo  folerafe  /  faulfs  [19].  In  Avid,  a  clienf  sends  each  server  an  erasure-coded 
fragmenf.  Each  server  sends  ifs  fragmenf  to  all  ofher  servers  such  fhaf  each  server  can  verify  fhaf 
fhe  block  is  encoded  correcfly.  This  all-fo-all  communicafion,  however,  consumes  slighfly  more 
bandwidfh  in  fhe  common  case  fhan  a  fypical  replicafion  profocol.  Chapter  2  provides  a  profocol 
fhaf  reduces  fhis  overhead  buf  sfill  requires  all-fo-all  communicafion  and  fhe  encoding  and  hashing 
of  3/4-  1  fragmenls  (Secfion  2.3).  These  shorfcomings  are  addressed  in  Secfion  4.2.1. 

Many  of  fhe  problems  in  PASIS  are  caused  by  fhe  need  fo  handle  Byzanfine  faulfy  clienfs.  Faulfy 
clients  should  be  tolerated  in  a  Byzantine  fault-tolerant  storage  system  to  prevent  such  clients  from 
forcing  the  system  into  an  inconsistent  state.  For  example,  though  a  faulty  client  can  corrupt  blocks 
for  which  it  has  write  permissions,  it  must  not  be  allowed  to  write  a  value  that  is  read  as  two 
different  blocks  by  two  different  correct  clients;  if  not,  a  faulty  client  at  a  bank,  e.g.,  could  provide 
one  account  balance  to  the  auditors  but  another  to  the  ATM.  Liskov  and  Rodrigues  [66]  propose 
that  servers  provide  public-key  signatures  to  vouch  for  the  state  of  the  system.  This  technique  can 
be  used  to  tolerate  Byzantine  faulty  clients  in  a  quorum  system.  In  the  next  section,  this  technique 
is  adapted  as  to  use  only  MACs  and  pseudo-random  nonce  values  in  a  PASiS-like  protocol. 


4.2  The  FP  Protocol 

This  section  describes  the  block  storage  protocol,  which  is  called  the  FP  protocol  to  reflect  its  usage 
of  homomorphic  fingerprinting.  A  separate  instance  is  executed  for  each  block,  so  this  section  does 
not  discuss  block  numbers.  Section  4.2.1  describes  the  design  of  the  protocol,  including  how  it 
builds  on  prior  protocols.  Section  4.2.2  describes  the  system  model.  Section  4.2.3  provides  pseudo¬ 
code  for  write  and  read  operations.  Section  4.2.4  discusses  liveness  and  linearizability. 
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4.2.1  Design 

Consider  the  following  replication-based  protocol  [66],  which  requires  3/-|-  1  servers  to  tolerate  / 
faults.  To  write  a  block,  a  client  hashes  the  block  and  sends  the  hash  to  all  servers  (the  prepare 
phase).  The  servers  respond  with  a  signed  message  containing  the  hash  and  a  logical  timestamp, 
which  is  always  greater  than  any  timestamp  that  the  server  has  seen.  If  there  are  not  at  least  2/  +  1 
matching  timestamps,  the  client  requests  that  each  server  sign  a  new  message  using  the  greatest 
timestamp  found.  The  client  commits  the  write  by  sending  the  entire  block  along  with  2/  + 1  signed 
messages  with  matching  timestamps  to  each  server.  The  server  verifies  that  the  signatures  are  valid 
and  that  the  hash  of  the  block  matches  the  hash  in  the  signed  message.  Because  this  protocol  uses 
public-key  signatures,  the  client  can  verify  the  responses  from  the  prepare  phase  before  it  attempts 
to  commit  the  write. 

To  read  the  block,  the  client  queries  all  servers  for  their  most  recent  timestamp  and  the  2/  -|-  1 
signatures  generated  in  the  prepare  phase.  The  client  reads  the  block  with  the  greatest  timestamp 
and  2/  -I-  1  correctly  signed  messages.  The  signatures  allow  the  the  client  to  verify  that  2/  -|-  1 
servers  provided  signatures  in  the  prepare  phase,  which  ensures  that  some  client  invoked  a  write  of 
this  block  at  this  timestamp  (at  least  one  of  these  servers  is  correct  and,  hence,  would  only  provide 
signatures  to  a  client  in  the  prepare  phase).  To  ensure  that  other  clients  see  this  block,  the  client 
writes  it  back  to  any  servers  with  older  timestamps. 

Sending  the  entire  block  causes  overhead  that  could  be  eliminated  by  sending  erasure-coded 
fragments  instead.  Writing  erasure-coded  fragments,  however,  poses  a  problem,  in  that  servers 
can  no  longer  agree  on  what  is  being  written.  A  faulty  or  malicious  client  can  write  fragments 
that  decode  to  different  blocks  depending  upon  which  subset  of  fragments  is  decoded.  BASIS  [44] 
uses  a  cross-checksum  [43],  which  is  a  set  of  hashes  of  the  erasure-coded  fragments,  to  detect 
such  inconsistencies.  To  write  a  block,  a  client  requests  the  most  recent  logical  timestamp  from 
all  servers  in  the  first  round.  In  the  second  round,  the  client  sends  each  server  its  fragment,  the 
cross-checksum,  and  the  greatest  timestamp  found.  Unfortunately,  BASIS  requires  4/-|-  1  servers  to 
tolerate  /  faults,  so  4/  -|-  1  fragments  must  be  encoded  and  hashed,  which  is  a  significant  expense. 
To  read  a  block,  the  client  reads  fragments  and  cross-checksums  from  each  server,  starting  with 
the  most  recent  timestamp,  until  it  finds  m  fragmenfs  whose  hashes  correspond  fo  fheir  respecfive 
locafions  in  fhe  cross-checksum.  From  fhese  fragmenfs,  fhe  clienf  decodes  fhe  block,  re-encodes 
n  fragmenfs,  and  recompufes  fhe  cross-checksum.  If  fhis  cross-checksum  does  nof  mafch  fhe  one 
provided  by  fhe  servers,  fhen  fhe  wrife  operafion  for  fhaf  fimesfamp  was  invalid  and  fhe  clienf  musf 
fry  reading  fragmenfs  af  an  earlier  fimesfamp.  If  cross-checksums  mafch,  fhe  clienf  wrifes  fragmenfs 
back  fo  servers  as  needed. 


Improvements 

The  replicafion-based  profocol  requires  only  3/  -|-  1  servers  buf  relies  on  public -key  signafures  and 
replicafion.  BASIS  improves  wrife  bandwidfh  buf  has  a  number  of  drawbacks,  as  discussed  in  Sec¬ 
tion  4.1.3.  The  protocol  proposed  in  fhis  paper  improves  on  fhese  approaches  wifh  fhe  following 
four  techniques. 

No  public-key  cryptography:  As  in  the  replication-based  protocol  [66],  the  response  to  a 
prepare  request  in  the  protocol  includes  an  authenticated  timestamp  and  checksum  that  allows  the 
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client  to  progress  to  the  commit  phase  once  enough  timestamps  match.  Castro  and  Liskov  [23]  avoid 
using  signatures  for  authentication  by  using  message  authentication  codes  (MACs)  in  the  common 
case  but  lazily  retrieving  signatures  when  needed.  This  lazy  retrieval  technique  does  not  work  for 
the  Liskov  and  Rodrigues  protocol,  however,  because  signatures  are  stored  for  later  use  to  vouch  for 
the  state  of  the  system  [66,  Section  3.3.2]. 

Instead,  signatures  are  avoided  altogether  and  rely  entirely  on  MACS  and  random  nonce  values. 
All  servers  share  pairwise  MAC  keys  (clients  do  not  create  or  verify  MACs).  Each  server  provides 
MACS  to  the  client  in  the  prepare  phase,  which  the  client  sends  in  the  commit  phase  to  prove  that 
enough  servers  successfully  completed  prepare  requests  at  a  given  timestamp.  Of  course,  a  faulty 
server  may  provide  faulty  MACs  during  the  prepare  phase,  or  it  may  reject  valid  MACs  during  the 
commit  phase.  To  recover  from  this,  a  client  may  need  to  gather  more  MACs  from  the  prepare  phase 
after  it  has  entered  the  commit  phase,  but  a  commit  eventually  completes. 

The  replication-based  protocol  also  uses  signatures  to  allow  servers  to  prove  to  a  reader  that 
a  block  was  written  by  some  client;  that  is,  to  prevent  a  server  from  returning  fabricated  data. 
Instead  of  signatures,  each  server  provides  a  pseudo-random  nonce  value  in  the  prepare  phase  of  the 
protocol.  The  client  aggregates  these  values  and  provides  them  to  each  server  in  the  commit  phase. 
During  a  read  operation,  a  server  provides  the  client  with  these  nonce  values  to  prove  that  some 
client  invoked  a  write  at  a  specific  timestamp,  as  will  be  described  in  Section  4.2.3  and  Lemma  4.2.5 
of  Section  4.2.4. 

Early  write:  Before  committing  a  write,  a  correct  server  must  ensure  that  enough  other  correct 
servers  have  fragments  for  this  write,  such  that  a  reader  will  be  able  to  reconstruct  the  block.  The 
replication-based  protocol  does  not  face  this  problem  because  each  server  stores  the  entire  block. 
The  protocol  proposed  in  this  paper  could  solve  this  problem  with  another  round  of  communication 
for  servers  to  confirm  receipf  of  a  fragmenf.  Instead,  in  the  protocol  and  unlike  previous  protocols, 
clients  send  erasure-coded  fragments  in  the  first  round  of  the  prepare  phase,  which  saves  a  round  of 
communication.  With  this  approach,  a  faulty  client  may  send  a  fragment  in  the  first  phase  without 
committing  the  write.  As  in  BASIS  [3],  a  server  may  limit  this  by  rejecting  a  write  from  a  client  with 
too  many  uncommitted  writes. 

Partial  encoding:  Basis  encodes  and  hashes  fragments  for  all  n  servers.  Encoding  this  many 
fragments  is  wasteful  because  /  servers  may  not  be  involved  in  a  write  operation.  Instead,  the  pro¬ 
tocol  encodes  and  hashes  fragments  for  only  the  first  n  —  f  servers,  which  lowers  the  computational 
overhead.  This  technique  is  called  partial  encoding  because  the  block  is  only  partially  encoded  for 
most  write  operations.  The  computational  savings  are  significant:  for  m  =  /  -|-  1,  the  protocol  en¬ 
codes  only  2/  -|-  1  fragments,  which  is  the  same  number  encoded  by  a  non-Byzantine  fault-tolerant 
erasure-coded  protocol.  Many  m-of-n  erasure  codes  encode  the  first  m  fragments  by  dividing  a 
block  into  m  fragments  (such  codes  are  said  to  be  systematic),  which  takes  little  if  any  computation. 
Hence,  encoding  2/-|- 1  fragments  requires  computing  /  values,  whereas  encoding  4/-|-  1  fragments 
(as  in  Basis)  requires  computing  3/  values. 

The  drawback  of  this  approach  is  that  if  one  of  the  first  rn-\-f  servers  is  non-responsive  or  faulty, 
the  client  may  need  to  send  the  entire  block  to  convince  another  server  that  its  fragment  corresponds 
to  the  checksum.  This  procedure  is  expensive:  not  only  does  it  consume  extra  bandwidth,  but  the 
server  must  verify  the  block  against  the  checksum.  To  verify  a  block,  the  server  encodes  the  block 
into  m  +  f  fragments,  hashes  each  fragment,  and  compares  these  hashes  to  the  checksum  provided 
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by  the  client.  If  the  hashes  match  the  checksum,  the  server  encodes  its  fragment  from  the  block. 
Fortunately,  the  first  m  +  f  servers  should  rarely  be  non-responsive  or  faulty. 

Distributed  verification  of  erasure-coded  data;  One  problem  in  Pasis  is  that  each  server 
knows  only  the  cross-checksum  and  its  fragment,  and  so  it  is  difficult  for  a  server  to  verify  that 
its  fragment  together  with  the  corresponding  fragments  held  by  other  servers  form  a  valid  erasure 
coding  of  a  unique  block.  Chapter  2  solves  this  problem  using  a  fingerprinted  cross-checksum. 
The  fingerprinted  cross-checksum  includes  a  cross-checksum,  as  used  in  PASIS,  along  with  a  set  of 
homomorphic  fingerprints  of  the  first  m  fragments  of  the  block.  The  fingerprints  are  homomorphic 
in  that  the  fingerprint  of  the  erasure  coding  of  a  set  of  fragments  is  equal  to  the  erasure  coding  of 
the  fingerprints  of  those  fragments.  The  overhead  of  computing  homomorphic  fingerprints  is  small 
compared  to  the  cryptographic  hashing  for  the  cross-checksum. 

The  f’'  fragment  is  said  to  be  consistent  with  a  fingerprinted  cross-checksum  if  its  hash  matches 
the  index  in  the  cross-checksum  and  its  fingerprint  matches  the  erasure  coding  of  the  homo¬ 
morphic  fingerprints.  Thus,  a  server  can  determine  if  a  fragment  is  consistent  with  a  fingerprinted 
cross-checksum  without  access  to  any  other  fragments.  Furthermore,  any  two  blocks  decoded  from 
any  two  sets  of  m  fragments  that  are  consistent  with  the  fingerprinted  cross-checksum  are  identical 
with  all  but  negligible  probability  (Theorem  2.2.4).  A  server  can  check  that  a  fragment  is  consistent 
with  a  fingerprinted  cross-checksum  shared  by  other  servers  on  commit,  allowing  it  to  overwrite 
old  fragments.  Thus,  only  fragments  that  are  in  the  process  of  being  written  must  be  versioned, 
obviating  the  need  for  on-disk  versioning.  This  technique  also  eliminates  most  of  the  computational 
expense  of  validating  the  cross-checksum  during  a  read  operation. 

Protocol  Overview 

This  section  provides  an  overview  of  the  protocol.  The  protocol  provides  wait-free  [49]  writes  and 
obstruction-free  [50]  reads  of  constant-sized  blocks  while  tolerating  a  fixed  number  of  Byzantine 
servers  and  an  arbitrary  number  of  Byzantine  clients  in  an  asynchronous  environment.  Figure  4.2.1 
provides  an  outline  of  the  pseudo-code  for  both  write  operations  and  read  operations.  The  line 
numbers  in  Figure  4.2.1  match  those  of  Figures  4.2.2  and  4.2.3.  Figure  4.2.2  provides  detailed 
pseudo-code  for  a  write  operation  and  is  described  line-by-line  in  Section  4.2.3.  Figure  4.2.3  pro¬ 
vides  detailed  pseudo-code  for  a  read  operation  and  is  described  line-by-line  in  Section  4.2.3. 

To  write  a  block,  a  client  encodes  the  block  into  m  +  f  fragments,  computes  the  finger¬ 
printed  cross-checksum,  and  sends  each  server  its  fragment  and  the  fingerprinted  cross-checksum 
(lines  1000-1106).  The  server  responds  with  a  logical  timestamp,  a  nonce,  and  a  MAC  for  each 
server  of  the  timestamp,  fingerprinted  cross-checksum,  and  nonce  (line  1405).  If  timestamps  do 
not  match,  the  client  requests  new  MACS  at  the  greatest  timestamp  found  (line  1 1 14).  Unlike  the 
signatures  in  the  Liskov  and  Rodrigues  protocol,  the  client  cannot  tell  if  these  MACS  are  valid. 

The  client  then  commits  this  write  by  sending  the  timestamp,  fingerprinted  cross-checksum, 
nonces,  and  MACS  to  each  server  (line  1 128).  A  correct  server  may  reject  a  commit  with  MACS  from 
faulty  servers  or  a  faulty  server  may  reject  a  commit  with  MACS  from  correct  servers.  Faults  should 
be  uncommon,  but  when  they  occur,  the  client  must  contact  another  server.  The  client  can  either 
try  the  commit  at  another  server  or  it  can  send  the  entire  block  to  another  server  in  order  to  garner 
another  prepare  response.  A  write  operation  returns  after  at  most  three  rounds  of  communication 
with  correct  servers.  Because  faults  and  concurrency  are  rare,  the  timestamps  received  in  the  first 
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c_write(B): 

1000:  di , . . . ,  d„,+^  ^  encodei ,,  /*  Partial  encoding  */ 

1001:  for  (is  {1, . . .  ,m  +  /})  do  fpcc.cc[/]  «—  hash(d,) 

1002:  for  (is  ,»t})  do  fpcc.fp[i]  ^  fingerprint(hash(fpcc.cc),  d,) 

...:  for  (i  s  {1, .. .,»!  +  /})  do 

1106:  Prepareli]  ^  S,  .S-rpc_prepare_frag(d,  ,ts,fpcc) 

1405:  /*  Server  returns  (ts,  nonce,  (MAC, ^((tSjfpcc,  nonce)))i<j<„)  */ 

1114:  if  (Preparc[i].ts  ^  ts)  then  /*  Retry  prepare  */ 

...:  ...  /*  See  lines  1 109-1 1 17  */ 

...:  for  (is  {1 ,m-|-/})  do 

1128:  Si. s_rpc_commit(ts,fpcc,  Prepare) 

c_read(): 

...:  for(is{l 2/4-1})  do 

1605:  (ts.fpcc)  ^  S/.s_rpc-flnd-timestamp() 

. . . :  for  (i  s  {1, . . .  ,m})  do 

1620:  d/  ^  S/.s_rpc_read(ts,fpcc) 

1817:  B  ^  decode(d| , . . .  ,d„,) 

. . . :  /*  Client  verifies  the  consistency  of  block  B  in  c  Jind.hlock  */ 

1626:  return  B 

Figure  4.2.1:  Pseudo-code  outline.  Line  numbers  match  Figures  4.2.2  and  4.2.3. 

round  of  prepare  will  often  match,  which  allows  most  write  operations  to  complete  in  only  two 
rounds. 

To  read  a  block,  a  client  requests  timestamps  and  fingerprinted  cross-checksums  from  2/  -|-  1 
servers  (line  1605)  and  fragments  from  the  first  m  servers  (line  1620).  If  the  m  fragments  are 
consistent  with  the  most  recent  fingerprinted  cross-checksum,  and  if  the  client  can  determine  that 
some  client  invoked  a  write  at  this  timestamp  (using  nonces  as  described  in  Section  4.2.3),  a  block  is 
decoded  (line  1817)  and  returned  (line  1626).  (In  Figure  4.2.3,  the  client  verifies  read  responses  and 
decodes  a  block  in  cJind_block.)  Mosf  read  operations  refurn  afler  one  round  of  communication 
wifh  correcf  servers.  If  a  concurrenf  wrife  causes  a  fragmenl  fo  be  overwritten,  however,  fhe  clienf 
may  be  redirecfed  fo  a  lafer  version  of  fhe  block,  as  described  in  Secfion  4.2.3. 


4.2.2  System  Model 

The  poinf-fo-poinf  communicafion  channel  befween  each  clienf  and  server  is  aufhenficafed  and  in- 
order,  which  can  be  achieved  in  pracfice  wifh  liffle  overhead.  Communicafion  channels  are  reliable 
buf  asynchronous,  i.e.,  each  message  sen!  is  evenfually  received,  buf  fhere  is  no  bound  assumed 
on  message  fransmission  delays.  Reliabilify  is  assumed  for  presenfafional  convenience  only;  fhe 
protocol  can  be  adapfed  fo  unreliable  channels  as  discussed  by  Martin  ef  al.  [73,  Section  4.3]. 

Up  fo  /  servers  and  an  arbifrary  number  of  clienfs  are  Byzantine  faulfy,  behaving  in  an  arbifrary 
manner.  An  adversary  can  coordinafe  all  faulfy  servers  and  clienfs.  To  bound  fhe  amounf  of  storage 
used  fo  sfage  fragmenfs  for  in-progress  wrifes,  fhere  is  a  fixed  upper  bound  on  fhe  number  of  clienfs 
in  fhe  sysfem  and  on  fhe  number  of  prepare  requesfs  from  each  clienf  fhaf  are  nof  followed  by 
subsequenf  commits. 

It  is  assumed  that  there  is  a  negligible  probability  that  a  MAC  can  be  forged  or  that  a  hash  col¬ 
lision  or  preimage  can  be  found.  The  fingerprinted  cross-checksum  requires  that  the  hash  function 
acts  as  a  random  oracle  [12].  All  servers  share  pairwise  MAC  keys.  The  value  labeled  nonce  must 
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not  be  disclosed  to  parties  except  as  prescribed  by  the  protocol,  in  order  to  prove  Lemma  4.2.5  in 
Section  4.2.4.  This  value  is  small  and  can  be  encrypted  at  little  cost. 

The  protocol  tolerates  /  Byzantine  faulty  servers  and  any  number  of  faulty  clients  given  an 
m-of-n  erasure  code  and  n  =  m  +  2f  servers,  where  m  >  /  +  1.  As  in  PAS  IS,  the  protocol  may  be 
deployed  with  m  >  /  +  1  to  achieve  higher  bandwidth  for  fixed  /. 


4.2.3  Detailed  Pseudo-code 

This  section  provides  detailed  pseudo-code  for  write  and  read  operations.  Pseudo-code  for  a  write 
operation  is  described  line-by-line  in  Section  4.2.3.  Pseudo-code  for  a  read  operation  is  described 
line-by-line  in  Section  4.2.3.  Presentation  simplicity  of  the  pseudo-code  is  chosen  over  optimiza¬ 
tions  that  may  be  found  in  an  actual  implementation. 


Notation 

The  protocol  relies  on  concurrent  requests  that  are  described  in  the  pseudo-code  by  remote  proce¬ 
dure  calls  and  coroutines.  The  cobegin  and  parallel  bars  represent  the  forking  of  parallel  threads 
of  execution.  Such  threads  stop  at  end  cobegin.  The  main  thi'ead  continues  to  execute  after  fork¬ 
ing  threads;  that  is,  the  main  thread  does  not  wait  to  join  forked  threads  at  end  cobegin.  Threads 
are  not  preempted  until  they  invoke  a  remote  procedure  call  or  wait  on  a  semaphore.  Semaphores 
are  binary  and  default  to  zero.  A  WAIT  operation  waits  on  a  semaphore,  and  SIGNAL  releases  all 
waiting  threads.  A  return  statement  halts  all  threads  and  returns  a  value. 

Each  operation  is  assigned  a  logical  timestamp,  represented  by  the  pair  (ts,fpcc).  Timestamps 
are  ordered  according  to  the  value  of  the  integer  ts  or,  if  they  share  the  same  ts,  by  comparison 
of  the  binary  value  fpcc.  Timestamps  that  share  the  same  ts  and  fpcc  are  equal.  The  most  recent 
commit  at  a  server  is  represented  as  latest_commit;  this  value  is  initialized  to  (0,null)  before  the 
protocol  starts.  Each  server  stages  fragments  for  concurrent  writes  and  stores  committed  fragments; 
both  staged  and  committed  fragments  are  kept  in  the  store  table. 

A  block  is  represented  as  B  and  a  fragment  as  d  in  the  pseudo-code.  A  cross-checksum,  ab¬ 
breviated  to  cc  in  the  pseudo-code,  contains  n  hashes,  cc[l], . . .  ,cc[n].  The  fingerprinted  cross¬ 
checksum,  abbreviated  to  fpcc,  contains  m  fingerprints,  fpcc.fp[l], . . .  ,fpcc.fp[m],  and  m+f  hashes, 
fpcc.cc[l],  .  .  .  ,fpCC.Cc[/M-|-/]. 

The  encode/(B)  function  encodes  block  B  into  its  erasure-coded  fragment.  The  decode(. . .) 
function  will  decode  the  first  m  fragments  provided  in  its  arguments,  and  the  index  of  each  frag¬ 
ment  is  passed  implicitly.  Abbreviate  encodey(decode(d,| , . . .  ,d;„J)  to  encodey(d;, , . . .  ,d,„,).  The 
fingerprint(h,d)  function  fingerprints  fragment  d  given  random  value  h.  The  random  value  in  the 
protocol  is  provided  by  a  hash  of  the  cross-checksum,  which  is  secure  so  long  as  the  hash  func¬ 
tion  acts  as  a  random  oracle  [12].  The  homomorphism  of  the  fingerprints  provides  the  follow¬ 
ing  property  (Theorem  2.1.11):  if  di,...,d„  ^  encodei  ...,„(B)  and  fp,-  <—  fingerprint(h,d,),  then 
fp,„  =  encode,o(fp;, , . .  •  ,fp,„,)  for  any  set  of  indices  /q,  . . 
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c_write(B):  /*  Wrapper  function  for  c.dowrite  */ 

1000:  dj , . . .  ♦—  encode  1 . 

1001:  for  (/ e  {1,. ••!'”  +  /})  do  fpcc.cc[/]  hash(d/) 

1002:  for  (/€{!,...  ,/n})  do  fpcc.fp[/]  ■*—  fingerprint(hash(fpcc.cc),d,) 

1003;  c_dowrite(B,NULL,fpcc) 

c.dowrite(B,ts,fpcc):  /*  Do  a  write  */ 

1 100:  Prepare[*]  *—  NULL 

1101:  cobegin 

1102:  11,-gji  /*  Start  worker  threads  */ 

1 103:  /*  Send  fragment  and  fpcc  to  server,  get  MAC  of  fpcc  and  latest  ts  */ 

1104:  d,- «— encode, -(B) 

1 105;  if  (fpcc.cc[/]  =  hash(d,))  then 
1 106:  Prepare[i]  *—  S,.s_rpc.prepare_frag(d,,ts,fpcc) 

1 107;  else  Prepare[i]  <—  S,.s_rpc.prepare_block(B,ts,fpcc)  /*  Bad  fpcc.cc[f]  */ 
1108: 

1 109:  /*  Choose  latest  ts,  if  needed  */ 

1110:  if  (ts  =  NULL  A  \{j  :  Prepare[j]  ^  null}|  =  2/+  1)  then 

1111:  ts^max{ts'  :  (ts',*)  CiPrepare) 

1 112;  SIGNAL(/o»Aiif  Jorgejr.t.v) 

1113:  if  (ts  =  null)  then  \NMT (found JargestJs)  /*  Wait  for  chosen  ts  */ 

1114:  if  {Prepare[i\  .ts  ^  ts)  then  /*  Need  new  prepare  for  chosen  ts  */ 

1115:  if  (fpcc.cc[/]  =  hash(d/))  then 

1116:  Prepare[{\  <—  S,.s_rpc_prepare_frag(d,-,ts,fpcc) 

1117:  else  Prepare[{\  ■*—  S,.s_rpc_prepare_block(B,ts,fpcc) 

1118; 

1 119:  /*  Attempt  commit  */ 

1120:  if  (|{7  :  Prepare[j\.ts  =  ts}|  >  m+/)  then  SIGNAL(/7re/An/-e.reoJv) 

1121:  end  cobegin 
1122: 

1123:  UnwrittenSet n) 

1 124;  while  (true)  do 
1 125:  V'JfA\T{prepare.ready) 

1126:  cobegin 

1127:  \\i^ UnwrittenSei  Start  worker  threads  */ 

1 128:  if  (SUCCESS  =  S,  .s.rpc_commit(ts, fpcc,  Prepare))  then 

1 129:  UnwrittenSet  <—  UnwritlenSel\{i} 

1 130:  if  {\UnwrinenSet\  <  f)  then  return  SUCCESS 

1131;  end  cobegin 


S,  .s_rpc_prepare_frag(d,ts,fpcc):  /*  Grant  permission  to  write  fpcc  at  ts  */ 

1200:  fp  fingerprint(hash(fpcc.cc),d);  h  ^  hash(d) 

1201;  fp'  ■<— encode, •(fpcc.fp[l], ... ,fpcc.fp[nj]) 

1202:  if(fp  =  fp'  A  h  =  fpcc.cc[j])  then  /*  Fragment  is  consistent  with  fpcc  */ 
1203:  return  S,.s_prepare_common(ts, fpcc, d, null) 

1204:  else  return  FAILURE  /*  Faulty  client  */ 

S,'.s.rpc_prepare.block(B,ts,fpcc):  /*  Grant  permission  to  write  fpcc  at  ts  */ 

1300;  /*  Called  when  partial  encoding  fails  or  when  writing  back  fragments  */ 

1301:  cnt^O 

1302:  for  (_/ G  {1, . . .  ,/?7+/})  do  /*  Validate  block  */ 

1303:  dj  «— encodey(B) 

1304;  fp  <—  fingerprint(hash(fpcc.cc),dj);  cc[j]  *—  hash(dy) 

1305:  fp'  <— encodey(fpcc.fp[l], . . .  ,fpcc.fp[m]) 

1306;  if  (fp  =  fp'  Acc[7]  =  fpcc.cc[_/])  then  cnt  ^  cnt  +  1 

1307:  if  (cnt  >  m)  then  /*  Found  m  fragments  consistent  with  fpcc  */ 

1308:  for  (_/  e  {m  +  /+!,...,«})  do  cc[y]  ■«—  hash(encodey(B)) 

1309:  return  S,  .s_prepare_common(ts,  fpcc,  encode,  (B) ,  cc) 

1310:  else  return  FAILURE  /*  Faulty  client  */ 

S,-.s.prepare_common(ts,fpcc,d,cc):  /*  Create  the  prepare  response  */ 

1400:  if  (ts  =  null)  then  ts  ^  latest.commit.ts+  1 
1401;  nonce  «— MAC,-,, '((ts, fpcc)) 

1402:  if  ((tSjfpcc)  >  latest_commit)  then 
1403:  nonce.hash  -f—  hash(nonce) 

1404:  store[(ts,fpcc)]  ^  (d,cc,  nonce.hash, null) 

1405:  return  (ts,  nonce,  (MAC,-, y((ts, fpcc, nonce)))i<y<„) 

S,-.sj*pc_conimit(ts,  fpcc, Prepare):  /*  Commit  write  of  fpcc  at  ts  */ 

1500:  if  ((ts,fpcc)  <  latest.commit)  then  return  SUCCESS  /*  Overwritten  */ 

1501:  Vo«ce.v -t— {(_;',  nonce)  :  Prepare[7]  =  (ts,  nonce,  (tag,|,)j<,^<„)  A 
1502:  tag,-  =  MACy,,((ts, fpcc, nonce))} 

1503:  if  (|Vonce.v|  >  A«  +  /)  then 

1504:  (d,cc, nonce.hash,*)  store[(ts,fpcc)] 

1505:  store[(ts,fpcc)]  ♦—  (d,cc,  nonce.hash,  VoAices) 

1506:  for  ((ts',fpcc')  <  (ts,fpcc))  do  store[(ts',fpcc')]  *—  NULL 

1507:  latest.commit  <—  (ts,fpcc) 

1508:  return  SUCCESS 

1509:  else  return  FAILURE 


Figure  4.2.2:  Detailed  write  pseudo-code. 


Write 

Pseudo-code  for  write  is  provided  in  Figure  4.2.2.  A  write  operation  is  invoked  with  a  block  of 
data  as  its  argument  and  returns  SUCCESS  as  its  response.  Write  is  divided  into  prepare  and  com¬ 
mit  phases.  The  pseudo-code  breaks  write  into  a  wrapper  function,  c_write,  and  a  main  function, 
c_dowrite,  such  that  the  main  function  can  be  reused  for  writing  back  fragments  during  a  read  oper¬ 
ation.  The  wrapper  function  encodes  the  block  into  m-\-  f  fragments  and  computes  a  finger-printed 
cross-checksum  (lines  1000-1002).  It  then  calls  c_dowrite  (line  1003). 

Prepare,  lines  1100-1121:  The  prepare  phase  is  described  in  the  top  half  of  c_dowrite.  The 
client  invokes  s_rpc_prepare_frag  at  each  of  the  first  m+ f  servers  with  its  fragment  and  the  finger¬ 
printed  cross-checksum  (line  1106);  ts  will  be  NULL.  A  correct  server  S,-  verifies  that  this  fragment 
is  consistent  with  the  fingerprinted  cross-checksum.  To  do  so,  it  first  computes  the  fingerprint  and 
the  hash  of  the  fragment  (line  1200).  It  then  computes  the  erasure  coding  of  the  homomorphic 
fingerprints  in  the  fingerprinted  cross-checksum  (line  1201).  Finally,  it  ensures  that  this  erasure 
coding  is  equal  to  the  fingerprint  of  this  fragment,  and  that  the  hash  is  equal  to  the  hash  in 
the  cross-checksum  (line  1202).  If  the  fragment  is  consistent,  the  server  prepares  a  response  in 
s_prepare_common  (line  1203). 

Because  of  partial  encoding,  there  are  only  m-\-  f  erasure-coded  fragments,  so  if  one  of 
the  first  m-\-  f  servers  is  not  responsive  or  if  commit  fails,  the  client  may  need  to  invoke 
s_rpc_prepare_block  with  the  entire  block  (line  1107).  A  correct  server  S,  verifies  that  the  era- 
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sure  coding  of  the  block  contains  at  least  m  fragments  that  are  consistent  with  the  fingerprinted 
cross-checksum.  To  do  so,  it  encodes  the  block  into  each  of  m  +  /  fragments  (line  1303),  computes 
the  fingerprint  and  the  hash  of  each  fragment  (line  1304),  and  computes  the  appropriate  erasure 
coding  of  the  homomorphic  fingerprints  in  the  fingerprinted  cross-checksum  (line  1305).  It  counts 
the  number  of  fragments  for  which  the  fingerprint  is  equal  to  the  erasure  coding  of  the  homomor¬ 
phic  fingerprints  and  the  hash  is  equal  to  the  appropriate  hash  in  the  fingerprinted  cross-checksum 
(line  1306).  If  there  are  at  least  m  such  consistent  fragments,  server  S,-  computes  the  fragment  d, 
and  the  rest  of  the  cross-checksum  of  all  n  fragments  and  prepares  a  response  in  s .prepare .common 
(lines  1307-1309). 

Invoking  SJ'pc .prepare .block  allows  a  client  to  write  a  fragment  that  is  not  consistent  with 
the  fingerprinted  cross-checksum,  so  long  as  this  fragment  can  be  erasure-coded  from  a  block  with 
m  erasure-coded  fragments  that  are  consistent  with  the  fingerprinted  cross-checksum.  This  will  be 
useful  in  the  read  protocol  to  ensure  that  any  fragment  can  be  written  back  as  needed. 

The  response  prepared  in  s. prepare  .common  consists  of  a  ts,  a  nonce,  and  n  MACS  (one  for 
each  server).  The  ts  may  be  provided  by  the  client;  if  not,  it  is  assigned  to  one  greater  than  the 
ts  portion  of  the  logical  timestamp  used  in  the  most  recent  commit  (line  1400).  The  nonce  is  a 
pseudo-random  value  that  is  unique  for  each  timestamp  (line  1401);  a  MAC  of  the  timestamp  can  be 
used  to  ensure  this  property.  An  array  of  n  MACS  is  computed  with  the  shared  pairwise  MAC  keys 
(line  1405).  The  MACS  are  used  in  the  commit  phase  to  authenticate  the  timestamp  and  nonce,  as 
well  as  to  prove  that  enough  correct  servers  stored  consistent  fragments. 

If  this  timestamp  is  more  recent  than  the  most  recently  committed  timestamp  (line  1402),  the 
nonce.hash,  a  preimage-resistant  hash  of  the  nonce,  is  computed  (line  1403),  and  the  fragment 
and  nonce.hash  are  stored  for  future  reads  (line  1404).  If  s. prepare  .common  was  called  by 
s.rpc.prepare.block,  the  correct  cross-checksum  of  all  n  fragments  is  also  stored.  (The  NULL  value 
on  line  1404  is  a  placeholder  that  will  be  filled  in  s  J'pc.commit.)  The  nonces  and  nonce.hashes 
are  used  fo  prove  fhaf  a  clienf  invoked  a  wrife  af  fhis  fimesfamp.  Ofher  protocols  ensure  fhis  prop¬ 
erly  in  ways  lhal  would  require  more  communicalion  [18],  public -key  signalures  [66],  or  4/+  1 
servers  [44];  nonces  are  used  here  fo  avoid  Ihese  mechanisms. 

The  clienf  musl  waif  for  2f  +\  responses  before  assigning  a  fimesfamp  to  fhis  wrife.  (The  firsf 
2/  Ihreads  waif  on  line  1113  unlil  a  fimesfamp  is  assigned.)  The  fimesfamp  is  Ihe  pair  (ts,fpcc), 
where  ts  is  Ihe  grealesf  ts  value  from  the  2/  -|-  1  responses.  If  a  server  provides  a  response  with  a 
different  timestamp,  the  client  must  retry  that  request  (lines  1 1 14-1 1 17). 

Commit,  lines  1123-1131:  After  m+f  servers  have  provided  MACS  in  responses  with  matching 
timestamps,  commit  may  be  attempted  (line  1 120).  The  commit  may  fail  if  a  faulty  server  provided 
one  of  these  MACS  or  rejects  a  MAC  from  a  correct  server.  But,  eventually,  at  least  m  +  f  correct 
servers  will  return  responses  with  MACS  that  will  be  accepted  by  at  least  m  +  f  servers  in  commit. 
As  prepare  responses  arrive,  they  are  forwarded  to  all  servers  (line  1128).  Thus,  the  threads  from 
the  prepare  phase  do  not  stop  until  all  servers  return  responses  or  the  commit  phase  completes.  The 
commit  phase  completes  and  the  client  can  return  once  m  +  f  servers  (all  but  /)  return  SUCCESS 
(line  1130). 

If  a  write  has  a  lower  timestamp  than  a  previously  committed  write,  a  server  can  ignore  it 
(line  1500).  A  correct  server  aggregates  nonces  from  valid  prepare  responses  (lines  1501-1502). 
A  prepare  response  is  valid  if  the  MAC  included  for  this  server  is  a  MAC  of  the  timestamp  and 
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c_read():  /*  Read  a  block  */ 

1600:  7f'n7e5/fl/?7p[*]  ■<— NULL;  Sra?e[*]<— 0 

1601; 

1602:  /*  Search  for  write  timestamps  */ 

1603:  cobegin 

1604;  3/+1}  /*  Start  worker  threads  */ 

1605:  Timesramp[i]  <—  S,.s_rpcJind_timestamp() 

1 606:  S I GN  ALifoiindJimestamp) 

1607:  end  cobegin 
1608: 

1609:  while  (true)  do 

1610:  WA I T  {found  . timestamp) 

1611:  /*  Try  any  timestamp  greater  or  equal  to  2/+  1  timestamps  */ 

1612:  for  ((ts,fpcc)  :  (ts,fpcc)  =  Timestamp[i\  A  (ts,fpcc)  ^  Tried  A 

1613:  |{y  :  (ts,fpcc)  >  Timestamp[]\}\  >  2/+  1)  do 

1614:  Tried  *—  Tried  U  {(ts,fpcc)} 

1615:  if  (ts  =  0)  then  return  null  /*  No  writes  yet  */ 

1616; 

1617:  cobegiu 

1618:  l|(e{i  ,;i}  Start  worker  threads  */ 

1619:  (tSjfpcc)  *—  (ts,fpcc)  /*  Thread  local  copy  of  variables  */ 

1620:  (data,gc_redirect)  S/.s_rpc_read(ts,fpcc) 

1621;  B  *—  cJind.block(7, Store, ts,fpcc,  data) 

1622: 

1623:  /*  Write  back  fragments  as  needed  and  return  the  block  */ 

1624:  if  (B  null)  then 

1 625 :  c.dowrite  ( B ,  fs,  fpcc) 

1626:  return  B 

1627: 

1628:  /*  Follow  garbage  collection  redirection  */ 

1629:  if  (gc.redirect  ^  NULL  A  gc.redirect  >  (ts,fpcc))  then 

1630;  Timestamp[i]  «— gc_redirect 

1631:  S\GNAL{foundJimestanip) 

1632:  end  cobegin 

S/.s_rpcJind.timestamp{);  /*  Return  the  latest  commit  */ 

1700:  return  latest.commit 


cJind_block(i, Store, ts, fpcc,  (d,cc,nonce_hash, Nonces)):  /*  Classify  read  */ 

1800:  if  (d  ^  NULL  A  cc  =  NULL)  then  /*  Verify  fragment-encoded  ai'guments  */ 
1801;  fp  <—  fingerprint(hash(fpcc.cc),d) 

1802:  fp^  «— encode, (fpcc. fp[l],...,fpcc.fp[777]) 

1803;  if  (fp  ^  fp'  V  hash(d)  ^  fpcc.cc[/])  then  return  null 
1804: 

1805:  if  (d  ^  null  A  cc  ^  null)  then  /*  Verify  block-encoded  arguments  */ 

1806:  if  (hash(d)  cc[/])  then  return  null 

1807: 

1808:  /*  Update  state  and  count  preimages  */ 

1809:  Srote[(ts,fpcc)]  «— Store[(ts,fpcc)]  U  {(/.d,cc,nonce.hash, Nonces)} 

1810;  npreimages  <—  |{_/  :  (_/,*,*, nonce.hash',*)  €  Store[(ts,fpcc)]  A 
1811:  (*i*i*i*){*i  (7>fionce'),*})  €  Srare[(ts,fpcc)]  A 

1812;  nonce.hash' =  hash(nonce')}| 

1813;  if  (npreimages  <  f-\- 1)  then  return  NULL 
1814; 

1815;  /*  Try  to  decode  */ 

1816;  Frags  *—  {d'  NULL  :  (>i',d',NULL,*,*)  €  Store[(ts,fpcc)]} 

1817:  if  (iFrag^l  >  m)  then  return  decode(Frogs) 

1818:  else  for  (cc'  null  :  (>i',*,cc',*,*)  €  Store’[(ts,fpcc)]})  do 
1819:  Frags'  ♦—  {d'  NULL  :  (*,d',cc',*,*)  £  Store[(ts,fpcc)]} 

1820;  if  (iFragi'  U  Frags'\  >  m)  then 

1821:  cnt  ^0;  B  <—  decode(Frog.r  U  Frags') 

1822:  for  (_/  G  {1, . . .  ,/?7-|-/})  do  /*  Validate  block  */ 

1823:  dy  «— encodey(B) 

1824;  fp  <— fingerprint(hash(fpcc.cc),dy);  h  ^  hash(dy-) 

1825:  fp'  <— encodey(fpcc.fp[l],...,fpcc.fp[«7]) 

1826;  if  (fp  =  fp'  A  h  =  fpcc.cc[;])  then  cnt cnt-h  1 

1827:  if  (cnt  >  m)  then  /*  Found  m  fragments  consistent  with  fpcc  */ 

1828;  return  B 

1829: 

1830:  return  NULL  /*  No  block  found  */ 

S/.s_rpc_read(ts,fpcc):  /*  Read  the  fragment  at  (ts,fpcc)  */ 

1900:  if  (store[(ts,fpcc)]  =  NULL  A  latest_commit  >  (ts,fpcc))  then 
1901:  return  ((null, null, null, null), latest_commit) 

1902:  else  return  (store[(ts, fpcc)], null) 


Figure  4.2.3:  Detailed  read  pseudo-code. 


nonce  computed  with  the  proper  pairwise  key.  If  there  are  at  least  m-\-  f  nonces  from  valid  prepare 
requests,  then  at  least  m  correct  servers  stored  a  fragment,  so  commit  will  succeed.  The  NULL 
value  from  line  1404  is  filled  in  with  these  nonces  (lines  1504-1505).  This  will  become  the  new 
most  recent  write  (line  1507).  If  a  client  tries  to  read  a  fragment  with  a  lower  timestamp,  it  can  be 
redirected  to  this  write,  so  earlier  fragments  can  be  garbage  collected  (line  1506).  Hence,  a  server 
must  stage  fragments  for  concurrent  writes  but  store  only  the  most  recently  committed  fragments. 


Read 

Pseudo-code  for  a  read  operation  is  provided  in  Figure  4.2.3.  A  read  operation  is  invoked  with  no 
arguments  and  returns  a  block  as  its  response.  A  read  operation  is  divided  into  two  phases,  “find 
timestamps”  and  “read  timestamp.”  The  client  searches  for  the  timestamps  of  the  most  recently 
committed  write  at  each  server.  As  timestamps  arrive,  the  client  tries  reading  at  any  timestamp 
greater  than  or  equal  to  2/  -|-  1  other  timestamps. 

Find  timestamps,  lines  1600-1615:  The  client  queries  each  of  the  first  3/-|-  1  servers  for  the 
timestamp  of  its  most  recently  committed  write  (lines  1602-1607).  As  timestamps  arrive,  the  client 
tries  reading  at  any  timestamp  that  it  has  yet  to  try  already  and  that  is  greater  than  or  equal  to  2/  -|-  1 
other  timestamps  (lines  1610-1614).  If  no  writes  have  been  committed,  the  value  latest_commit 
(line  1700)  defaults  to  (0,  null)  ;  if  2/+ 1  or  more  servers  return  (0,  null),  no  writes  have  returned 
yet  so  a  NULL  block  is  returned  (line  1615). 
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Read  timestamp,  lines  1617-1632:  To  read  a  fragment,  the  client  invokes  s_rpc_read  at  each 
server  (line  1620).  A  correct  server  returns  the  fragment  along  with  the  other  data  stored  during 
write  (line  1902).  The  client  processes  each  response  from  s_rpcj'ead  with  the  helper  function 
cJind-block  (line  1621).  This  function  verifies  that  the  fragment  is  valid  (lines  1800-1806),  de¬ 
termines  whether  a  client  invoked  this  write  (lines  1810-1813),  and  decodes  a  block  if  possible 
(lines  1815-1828).  If  a  correct  server  has  no  record  of  this  fragment  but  knows  of  a  more  recent 
write,  it  returns  the  timestamp  of  the  more  recent  write  (line  1901).  The  client  follows  such  garbage 
collection  redirections  if  a  block  is  not  found  (lines  1628-1631). 

If  a  correct  server  received  a  fragment  in  a  successful  call  to  s_rpc_prepare_frag,  the  value  cc 
will  be  set  to  NULL  and  the  client  will  verify  that  the  fragment  is  consistent  with  the  fingerprinted 
cross-checksum  (lines  1801-1803).  If  a  correct  server  received  a  fragment  in  a  successful  call  to 
s_rpc_prepare .block,  the  value  cc  will  be  the  cross-checksum  of  all  n  fragments.  The  client  verifies 
that  the  cross-checksum  cc  matches  at  least  this  fragment  (line  1806).  If  either  verification  fails,  the 
response  is  ignored  (lines  1803  and  1806).  Otherwise,  the  client  records  this  fragment  in  the  state 
for  this  timestamp  (line  1809)  and  tries  to  determine  whether  a  client  invoked  this  write  (lines  1810- 
1813).  If  there  are  at  least  /-|-  1  nonces  that  are  the  preimages  of  nonce.hashes,  one  was  generated 
by  a  correct  server,  which  implies  that  a  client  invoked  a  write  with  this  timestamp  (i.e.,  the  write 
was  not  fabricated  by  faulty  servers)  and  so  it  is  eligible  to  be  examined  further.  Otherwise,  the 
client  waits  for  more  nonces  before  trying  to  decode  a  block. 

If  enough  nonces  are  found,  the  client  tries  to  reconstruct  the  block.  A  block  can  always  be 
decoded  given  m  fragments  consistent  with  the  fingerprinted  cross-checksum  (line  1817).  If  any 
fragments  were  provided  with  an  additional  cross-checksum  value  cc,  the  client  can  reconstruct 
a  block  and  check  if  the  erasure-coding  of  that  block  includes  m  fragments  consistent  with  the 
fingerprinted  cross-checksum.  Since  all  correct  servers  will  produce  the  same  cross-checksum  in 
s_rpc_prepare .block,  it  suffices  to  check  each  value  of  cc  in  turn  (line  1818).  If  m  fragments  were 
returned  with  the  same  value  cc  or  are  consistent  with  the  fingerprinted  cross-checksum,  the  client 
decodes  a  block  (line  1821).  If  at  least  m  fragments  in  the  erasure-coding  of  this  block  are  consistent 
with  the  fingerprinted  cross-checksum  (lines  1822-1827),  this  block  will  be  returned.  Note  that  the 
check  in  cJind.block  (lines  1822-1827)  is  identical  to  that  in  s_rpc_prepare_block  (lines  1302- 
1307). 

If  a  block  is  found,  it  is  returned  as  the  response  of  the  read  (line  1626).  To  ensure  that  this  block 
is  seen  by  subsequent  reads,  the  client  writes  back  fragments  as  needed  (line  1625).  In  practice,  the 
client  can  skip  write  back  if  any  2/  -|-  1  of  the  first  3/  +  1  servers  claim  to  have  committed  this 
timestamp  or  a  more  recent  one. 

4.2.4  Correctness 

This  section  provides  arguments  for  the  safety  and  liveness  properties  of  the  protocol. 

Liveness 

This  section  argues  the  liveness  properties  of  write  and  read  operations.  Two  notions  of  liveness 
are  considered,  namely  wait  freedom  [49]  and  obstruction  freedom  [50].  Informally,  an  operation 
is  wait-free  if  the  invoking  client  can  drive  the  operation  to  completion  in  a  finite  number  of  steps. 
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irrespective  of  the  behavior  of  other  clients.  An  operation  is  obstruction-free  if  the  invoking  client 
can  drive  the  operation  to  completion  in  a  finite  number  of  steps  once  all  other  clients  are  inactive  for 
sufficiently  long.  That  is,  an  obstruction-free  operation  may  not  complete,  but  only  due  to  continual 
interference  by  other  clients. 

Theorem  4.2.1.  Write  operations  are  wait-free. 

Proof.  c_dowrite  invokes  s_rpc_prepare_frag  (line  1106)  or  s_rpc_prepare_block  (line  1107)  at 
each  server.  Each  such  call  at  a  correct  server  returns  successfully  (line  1203  or  1309),  implying 
that  the  client  receives  at  least  n  —  f  >2f  +\  responses.  Consequently,  if  ts  as  input  to  c_dowrite 
is  NULL  then  the  largest  ts  returned  by  servers  is  chosen  (line  1111)  undfound largest  Js  is  signaled 
(line  1112).  An  sj'pc .prepare _frag  (line  1116)  or  sj'pc.prepare.block  (line  1117)  call  is  then 
placed  at  each  server  that  did  not  return  this  ts.  If  ts  as  input  to  c.dowrite  is  not  NULL,  all  correct 
servers  will  return  this  ts.  By  the  time  the  last  of  the  threads  that  will  reach  line  1120  does  so  (if 
not  sooner),  all  correct  servers  have  contributed  a  response  for  the  same  timestamp  to  Prepare  from 
s_rpc_prepareTrag  or  s  J'pc.prepare.block,  causing  prepare _ready  to  be  signaled.  The  collected 
set  of  prepare  responses  in  Prepare  is  then  sent  to  all  servers  in  an  SJ'pc.commit  (line  1128). 
Because  at  least  m-|-/  of  the  prepare  responses  in  Prepare  are  from  correct  servers,  at  least  m  +  f 
of  the  prepare  responses  contain  correct  MAC  values  (line  1501)  and  so  Nonces  will  include  at 
least  ni-Tf  tuples  (line  1503).  Hence,  these  SJ'pc.commit  calls  to  correct  servers  return  SUCCESS 
(line  1508),  and  so  the  write  operation  completes  (line  1130).  □ 

Deeinition  4.2.2.  If  S,.SJ'pc_commit(ts,fpcc, Prepare)  returns  SUCCESS,  then  this  commit  at  S; 
is  said  to  rely  on  Sj  if  Prepare)^]  =  (ts,  nonce,  (tag^)i<,t<„)  and  tag;  =  MACp;((ts,fpcc,  nonce)). 

Lemma  4.2.3.  In  a  correct  client’s  cj'ead,  suppose  that  for  a  fixed  (ts,fpcc)  fhe  following  occurs: 
From  some  correcf  S,  fhaf  previously  relumed  SUCCESS  fo  SJ'pc_commit(ts,  fpcc,  *),  and  from 
each  of  m  correcf  servers  on  which  fhe  firsf  such  commit  at  S,-  relies,  the  client  receives  (data,  *) 
in  response  to  an  s  j'pcj'ead(ts,  fpcc)  call  on  that  server  (line  1620)  where  data  7^  (null,  null, 
NULL,  null).  Then,  the  call  to  cJind_block  (line  1621)  including  the  last  such  response  returns  a 
block  B  7^  NULL. 

Proof.  Consider  such  a  data  =  (d, cc,  nonce.hash, Aonce^)  received  from  a  correct  server  Sj.  d 
either  is  consistent  with  fpcc  as  verified  by  S/  in  s  J'pc.prepareJ’rag  (lines  1200-1202)  and  verified 
by  fhe  clienl  in  c_find_block  (lines  1800-1803),  or  cc  malches  Ihis  fragmenl,  as  generaled  by  Sy  in 
s_rpc_prepare_block  (lines  1304  and  1308)  and  verified  by  fhe  clienf  in  cJind.block  (line  1806). 
In  fhe  lalfer  case,  fhe  facl  lhal  Sy  reached  line  1404  (where  if  saved  (d,cc,  nonce.hash,*))  implies 
fhaf  previously  cut  >  m  in  line  1307,  and  so  by  fhe  properfies  of  fingerprinted  cross-checksums 
(Theorem  2.2.4),  all  such  servers  received  fhe  same  inpuf  block  B  in  calls  s_rpc_prepare_block(B, 
ts,  fpcc)  and  so  constructed  the  same  cc  in  s  J’pc-prepare.block. 

Now  consider  the  response  data  =  (d,  cc,  nonce.hash,  Aonce^)  from  the  correct  server  S,.  Recall 
that  S;  previously  returned  SUCCESS  to  sj'pc_commit(ts,  fpcc,  *),  and  that  the  first  such  commit 
relied  on  the  m  correct  servers  Sy.  Since  the  focus  is  on  the  first  such  commit  at  S,,  and  since  data 
7^  (null,  null,  null,  null),  the  success  response  was  generated  in  line  1508,  not  1500.  In 
this  case.  Nonces  7^  NULL  by  lines  1503  and  1505,  and  in  fact  includes  (7,  nonce)  pairs  for  the  m 
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correct  servers  Sy  on  which  this  commit  relies.  When  the  last  data  from  S;  and  these  m  servers  Sy  is 
passed  to  cJind_block  (line  1621),  each  nonce_hash  present  in  each  Sy’s  data  will  have  a  matching 
nonce  in  S,’s  Nonces,  i.e.,  such  that  nonce.hash  =  hash(nonce).  Hence,  npreimages  >  m  >  /  +  1 
(lines  1810-1813),  and  so  the  client  will  try  to  decode  a  block  in  lines  1815-1828. 

If  the  responses  from  the  m  servers  Sy  on  which  the  commit  at  S,-  relies  have  cc  =  NULL,  a  block 
is  decoded  and  returned  (line  1817).  Otherwise,  the  client  eventually  tries  to  decode  these  fragments 
accompanied  by  cc  =  NULL  (the  set  Frags)  together  with  those  accompanied  by  cc  7^  NULL  provided 
by  these  correct  servers  Sy  (the  set  Frags')',  see  line  1821.  Each  S/  contributing  a  fragment  of  the 
latter  type  verified  that  fpcc  was  consistent  with  m  fragments  derived  from  the  block  B  input  to 
s_rpc_prepare_block  (lines  1302-1307),  and  generated  its  fragment  to  be  a  valid  fragment  of  B. 
Each  fragment  of  the  former  type  was  verified  by  Sy  fo  be  consisfenf  wifh  fpcc  (lines  1200-1202), 
and  so  is  a  valid  fragmenf  of  B  (wifh  overwhelming  probabilify  by  Corollary  2.1.12).  Consequenfly, 
upon  decoding  any  m  of  fhese  fragmenfs,  fhe  clienf  obfains  B,  will  find  m  fragmenfs  of  fhe  resulting 
block  fo  be  consisfenf  wifh  fpcc  (lines  1822-1827),  and  so  will  refurn  B  (line  1828).  □ 

Theorem  4.2.4.  The  read  protocol  is  obsfrucfion-free. 

Proof.  In  c_read,  a  call  fo  s_rpc_flnd -timestamp  is  made  to  servers  1,...,3/+  1  (line  1605),  to 
which  each  of  at  least  2/+  1  correct  servers  responds  with  its  value  of  latest_commit  (line  1700). 
Consider  the  greatest  timestamp  (ts,fpcc)  returned  by  a  correct  server,  say  S;.  This  timestamp 
is  greater  than  or  equal  to  the  timestamp  from  at  least  the  2/  +  1  correct  servers  that  responded 
(checked  in  lines  1612-1613),  so  the  client  tries  to  read  fragments  via  s_rpc_read  at  this  time- 
stamp  (line  1620).  This  timestamp  was  previously  committed  by  S,,  as  a  correct  server  updates 
latest_commit  only  in  s_rpc_commit  at  line  1507.  Moreover,  this  commit  relies  on  at  least  m  cor¬ 
rect  servers  Sy;  see  line  1503.  Now  consider  the  following  two  possibilities  for  each  of  these  correct 
servers  Sy  on  which  the  commit  relies: 

•  Sy  assigned  to  store[(ts,fpcc)]  in  line  1404  because  the  condition  in  line  1402  evaluated  to 
true,  and  has  not  subsequently  deleted  store[(ts,fpcc)]  in  line  1506.  In  this  case,  Sy  returns 
the  contents  of  store[(ts,fpcc)]  in  response  to  the  s_rpc_read  call  (line  1902). 

•  Sj  either  did  not  assign  to  store[(ts,  fpcc)]  in  line  1404  because  the  condition  in  line  1402 
evaluated  to  false,  or  deleted  store[(ts,fpcc)]  in  line  1506.  In  this  case,  store[(ts,fpcc)]  = 
NULL  and  latest_commit  >  (ts,fpcc)  in  line  1900  (due  to  lines  1402  and  1507),  and  so  Sy 
returns  latest_commit  in  response  to  the  s_rpc_read  call  (line  1901). 

If  all  m  correct  servers  S,  fall  into  the  first  case  above,  then  one  of  the  client’s  calls  to 
c_find_block  (line  1621)  returns  a  non-NULL  block  (Eemma  4.2.3).  The  client  writes  this  with  a 
c_dowrite  call  (line  1625),  which  is  wait-free  (Theorem  4.2.1),  and  then  completes  the  c_read.  If 
some  Sy  falls  into  the  second  case  above,  then  the  timestamp  it  returns  (or  a  higher  one  returned  by 
another  correct  server)  satisfies  the  condition  in  lines  1612-1613  and  so  the  client  will  subsequently 
read  at  this  timestamp  (line  1620)  if  a  non-NULL  block  is  not  first  returned  from  cJind_block 
(line  1621). 

Consequently,  for  the  client  to  never  return  a  block  in  a  c_read,  correct  servers  must  contin¬ 
uously  return  increasing  timestamps  in  response  to  s_rpc_read  calls.  If  there  are  no  concurrent 
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commits,  then  the  client  must  reach  a  timestamp  at  which  it  returns  a  block.  Hence,  the  read  proto¬ 
col  is  obstruction-free.  □ 


Linearizability 

Informally,  linearizability  [51]  requires  that  the  responses  to  read  operations  are  consistent  with 
an  execution  of  all  reads  and  writes  in  which  each  operation  is  performed  at  a  distinct  moment 
in  real  time  between  when  it  is  invoked  and  when  it  completes.  Only  reads  by  correct  clients 
need  be  considered,  because  no  guarantees  are  provided  to  faulty  clients.  Since  writes  by  faulty 
clients  can  be  read  by  correct  clients,  however,  such  writes  cannot  be  ignored.  Consequently,  the 
execution  of  s _rpc_prepare_frag  or  s _rpc_prepare_block  by  a  faulty  client  at  a  correct  server  that 
returns  (ts,  nonce,  *)  (i.e.,  returns  a  value  on  line  1405  rather  than  returning  FAILURE  on  line  1204 
or  1310)  is  defined  as  to  instantiate  a  write  invocation  at  the  beginning  of  time.  The  timestamp  of 
the  invocation  is  the  pair  (ts,  fpcc)  used  to  generate  a  nonce  (line  1401).  Each  operation  by  a  correct 
client  also  gets  an  associated  timestamp  (ts,fpcc).  For  a  write  operation,  the  timestamp  is  that  sent 
to  s _rpc_commit.  For  a  read  operation,  the  timestamp  is  that  of  the  write  operation  from  which  it 
read. 

Proving  that  faulty  write  operations  and  correct  write  and  read  operations  are  linearizable  shows 
that  the  protocol  guarantees  a  natural  extension  to  linearizability,  limiting  faulty  clients  to  invoking 
writes  that  they  could  have  invoked  anyway  at  similar  expense  had  they  followed  the  protocol.  The 
following  five  lemmas  are  used  to  prove  that  such  a  history  is  linearizable. 

Femma  4.2.5.  A  read  will  share  a  timestamp  with  a  write  that  has  been  invoked  by  some  client. 


Proof.  Per  line  1813,  a  call  to  c_find_block(*.  State,  ts,  fpcc,  *)  returns  a  non-NULL  value  only 
if  State[{ts,  fpcc)],  possibly  modified  per  line  1809,  includes  a  nonce_hash  from  some  correct  Sj 
such  that  some  S,.s_rpc_read(ts,  fpcc)  returned  data  =  (*,  *,  *,  Nonces),  Nonces  3  {j,  nonce)  and 
hash  (nonce)  =  nonce_hash  (data  was  passed  to  this  or  a  previous  cJind_block(*,  State,  ts,  fpcc, 
*)  call,  see  lines  1620-1621,  and  then  added  to  State[{ts,  fpcc)]  in  line  1809).  This  nonce  was 
created  by  Sj  on  line  1401  with  (ts,  fpcc).  Since  a  correct  writer  keeps  each  nonce  secret  unless  it  is 
returned  from  s  _rpc  .prepare  _frag  or  sj'pc  .prepare  .block  for  the  timestamp  on  which  it  settles  for 
its  write  timestamp,  this  nonce  shows  that  the  writer,  if  correct,  adopted  (ts,  fpcc)  as  its  timestamp; 
consequently,  the  write  with  this  timestamp  was  invoked,  satisfying  the  lemma.  If  no  correct  writer 
performed  a  write  with  timestamp  (ts,  fpcc),  then  the  creation  of  nonce  by  Sy  in  line  1401  with  (ts, 
fpcc)  implies  that  the  write  with  timestamp  (ts,  fpcc)  was  invoked  by  a  faulty  client.  □ 

Femma  4.2.6.  Consider  two  invocations 

B  ^  cJind.block(*,*,ts,fpcc,*) 

B'  ^  cJind.block(*,*,ts,fpcc,*) 

at  correct  clients  for  the  same  timestamp  (ts,fpcc).  If  B  7^  NULL  and  B'  f  NULL,  then  B  =  B'  with 
all  but  negligible  probability. 
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Proof.  A  block  B  /  NULL  is  returned  by  c_find_block(*,  ts,  fpcc,  *)  at  either  line  1817  or 
line  1828.  B  is  returned  at  line  1828  only  if  at  least  m  erasure-coded  fragments  produced  from  B 
are  consistent  with  fpcc,  as  checked  in  lines  1822-1827.  Similarly,  B  is  returned  at  line  1817  only 
after  it  is  reconstructed  from  at  least  m  fragments  d'  such  that  (*,  d',  cc,  *,  *)  G  State[{x.s,  fpcc)] 
and  cc  =  NULL;  each  such  d'  was  confirmed  to  be  consistent  with  fpcc  in  lines  1800-1803,  in  either 
this  or  an  earlier  invocation  of  the  form  c_find_block(*,  *,  ts,  fpcc,  *).  In  either  case,  B  has  at  least 
m  erasure-coded  fragments  consistent  with  fpcc.  If  blocks  B  and  B'  each  have  at  least  m  erasure- 
coded  fragments  that  are  consistent  with  the  same  fpcc,  they  are  the  same  with  all  but  negligible 
probability  (Theorem  2.2.4).  □ 

Lemma  4.2.6  states  that  two  correct  clients  who  read  blocks  at  the  same  timestamp  read  the  same 
block,  since  the  block  returned  from  c_read  is  that  produced  by  c_find_block  (lines  1621-1626). 

Lemmas  4.2.7^.2.9  show  that  timestamp  order  for  operations  is  consistent  with  real-time  prece¬ 
dence. 

Lemma  4.2.7.  Consider  two  write  operations  performed  by  correct  clients.  If  the  response  to  one 
precedes  the  invocation  of  the  other,  then  the  timestamp  of  the  former  is  less  than  the  timestamp  of 
the  latter. 


Proof.  Before  the  earlier  write  returns  a  response  in  line  1130,  at  least  n  —  f  servers  returned 
SUCCESS  from  s_rpc_commit(ts,  fpcc.  Prepare),  where  (ts,  fpcc)  is  the  timestamp  of  this  write.  In 
doing  so,  at  least  m  =  n  —  2f  correct  servers  record  latest.commit  ^  (ts,  fpcc)  (line  1507)  if  not 
greater  (line  1500).  Consequently,  the  ts  value  returned  in  the  prepare  phase  for  the  later  write  by 
these  servers  will  be  greater  than  the  ts  value  for  the  earlier  write  (line  1400).  Because  there  are 
n  =  m  +  2f  total  servers,  any  2/  -|-  1  servers  will  include  one  of  these  m  correct  servers,  so  the  ts 
chosen  (line  1111)  will  be  greater  than  the  ts  value  in  the  timestamp  for  the  earlier  write.  □ 

Lemma  4.2.8.  Consider  a  write  operation  and  a  read  operation,  both  performed  by  correct  clients. 
If  the  response  to  the  write  precedes  the  invocation  of  the  read,  then  the  timestamp  of  the  write  is  at 
most  the  timestamp  of  the  read. 


Proof.  Before  the  write  returns  a  response  in  line  1130,  at  least  n  —  2f  correct  servers  returned 
SUCCESS  from  s_rpc_commit(ts,  fpcc.  Prepare),  where  (ts,  fpcc)  is  the  timestamp  of  this  write.  In 
doing  so,  at  least  n  —  2f  correct  servers  record  latest.commit  ^  (ts,  fpcc)  (line  1507)  if  not  greater 
(line  1 500).  These  n  —  2f  =  m  correct  servers  include  at  least  /  -|-  1  of  the  servers  1 , . . . ,  3/  -|-  1 , 
and  so  in  the  read  operation  at  most  2/  of  servers  1, . . .  ,3/-|-  1  respond  to  s_rpc_find .timestamp 
at  line  1700  with  a  lower  timestamp  than  (ts,  fpcc).  Because  a  read  considers  only  timestamps  that 
are  at  least  as  large  as  those  returned  by  2/  -|-  1  of  the  first  3/  -|-  1  servers  (line  1613),  the  read  will 
be  assigned  a  timestamp  at  least  as  large  as  (ts,  fpcc).  □ 

Lemma  4.2.9.  If  the  response  of  a  read  operation  by  a  correct  client  precedes  the  invocation  of 
another  (write  or  read)  operation  by  a  correct  client,  then  the  timestamp  of  the  former  operation  is 
at  most  the  timestamp  of  the  latter. 
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Proof.  A  read  calls  c_dowrite  at  its  timestamp  before  returning  a  response  (line  1625).  This  will 
have  the  same  affect  as  completing  a  write  at  that  timestamp.  Consequently,  the  later  operation  will 
have  a  higher  timestamp  if  it  is  a  write  (Lemma  4.2.7)  and  a  timestamp  at  least  as  high  if  it  is  a  read 
(Lemma  4.2.8).  □ 

Theorem  4.2.10.  Write  and  read  operations  are  linearizable. 


Proof.  To  show  linearizability,  construct  a  linearization  (total  order)  of  all  read  operations  by  correct 
clients  and  all  write  operations  that  is  consistent  with  the  real-time  precedence  between  operations 
such  that  each  read  operation  returns  the  block  written  by  the  preceding  write  operation.  First,  order 
all  writes  in  increasing  order  of  their  timestamps.  By  Lemma  4.2.7,  this  ordering  does  not  violate 
real-time  precedence.  Next,  place  all  read  operations  with  the  same  timestamp  immediately  follow¬ 
ing  a  write  operation  with  that  timestamp,  ordered  consistently  with  real-time  precedence  (i.e.,  each 
read  is  placed  somewhere  after  all  other  operations  with  the  same  timestamp  that  completed  before 
it  was  invoked).  By  Lemma  4.2.5,  each  read  is  placed  after  some  write  operation.  This  placement 
does  not  violate  real-time  precedence  with  the  next  (or  any)  write  operation  in  the  linearization, 
since  if  the  next  write  had  completed  before  this  read  began,  then  by  Lemma  4.2.8  this  read  op¬ 
eration  could  not  have  the  timestamp  it  does.  Real-time  precedence  between  reads  with  different 
timestamps  cannot  be  violated  by  this  placement,  by  Lemma  4.2.9.  Since  all  reads  with  the  same 
timestamp  read  the  same  block  (Lemma  4.2.6) — which  is  the  block  written  in  the  write  with  that 
timestamp  if  the  writer  was  correct — write  and  read  operations  are  linearizable.  □ 


4.3  Implementation 

The  prototype  is  evaluated  and  compared  to  competing  approaches  using  a  distributed  storage  pro¬ 
totype.  The  low-overhead  fault-tolerant  prototype  consists  of  a  client  library,  linked  to  directly  by 
client  applications,  and  a  storage  server  application.  The  prototype  supports  the  protocol  described 
in  Section  4.2  as  well  as  several  competing  protocols,  as  described  in  Section  4.4.1. 

The  client  library  interface  consists  of  two  functions,  “read  block”  and  “write  block.”  In  addition 
to  the  parameters  described  in  Section  4.2,  read  and  write  accept  a  block  number  as  an  additional 
argument,  which  can  be  thought  of  as  running  an  instance  of  the  protocol  in  parallel  for  every 
block  in  the  system.  Each  server  in  a  pool  of  storage  servers  runs  the  storage  server  application, 
which  accepts  incoming  RPC  requests  and  executes  as  described  in  Section  4.2.  Clients  and  servers 
communicate  with  remote  procedure  calls  over  TCP  sockets.  Each  server  has  a  large  NVRAM 
cache,  where  non- volatility  is  provided  by  battery  backup.  This  allows  most  writes  and  many  reads 
to  return  without  disk  I/O. 

The  prototype  uses  16  byte  fingerprints  generated  with  the  evaluation  homomorphic  fingerprint¬ 
ing  function  (Section  2.4).  Eingerprinting  is  fast,  and  only  the  first  m  fragments  are  fingerprinted 
(the  other  fingerprints  can  be  computed  from  these  fingerprints).  Homomorphic  fingerprinting  re¬ 
quires  a  small  random  value  for  each  distinct  block  that  is  fingerprinted.  This  random  value  is 
provided  by  a  hash  of  the  cross-checksum  in  the  protocol.  After  computing  this  random  value,  the 
implementation  precomputes  a  64  kB  table,  which  takes  about  20  microseconds.  After  computing 
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this  table,  fingerprinting  each  byte  requires  one  table  lookup  and  a  128-bit  XOR.  This  implementation 
can  fingerprint  about  410  megabytes  per  second  on  a  3  GHz  Pentium  D  processor. 

The  client  library  uses  Rabin’s  Information  Dispersal  Algorithm  [84]  for  erasure  coding.  The 
first  m  fragments  consist  of  the  block  divided  into  m  equal  fragments  (that  is,  a  systematic  encoding 
is  used).  Since  m>  f  and  only  m  +  /  fragments  must  be  encoded,  this  cuts  the  amount  of  encoding 
by  more  than  half. 

The  SHA-1  and  HMAC  implementations  from  the  Nettle  toolkit  [76]  is  used,  which  can  hash 
about  280  megabytes  per  second  on  a  3  GHz  Pentium  D  processor.  Each  hash  value  is  20  bytes, 
and  each  MAC  is  8  bytes.  Due  to  ongoing  advances  in  the  cryptanalysis  of  SHA-1  [33],  a  storage 
system  with  a  long  expected  lifetime  may  benefit  from  a  stronger  hash  function.  The  performance 
of  SHA-512  on  modern  64-bit  processors  has  been  reported  as  comparable  to  that  of  SHA-1  [42]. 
Hence,  though  not  measured,  the  prototype  should  achieve  similar  performance  if  the  hash  were 
upgraded  on  such  systems. 

The  prototype  implements  a  few  simple  optimizations  for  the  protocol.  For  example,  during 
commit,  only  the  appropriate  pairwise  MAC  is  sent  to  each  server.  The  client  tries  writing  at  the 
first  m  +  f  servers,  considering  other  servers  only  if  these  servers  are  faulty  or  unresponsive.  Simi¬ 
larly,  the  client  requests  timestamps  from  the  first  2/-|-  1  servers  and  reads  fragments  from  the  first  m 
servers,  which  allows  most  reads  to  return  in  a  single  round  of  communication  without  requiring  any 
decoding  beyond  concatenating  fragments.  Furthermore,  if  /  -|-  1  or  more  servers  return  matching 
timestamps,  the  client  does  not  request  nonces  because  it  can  conclude  that  a  correct  server  com¬ 
mitted  this  write  and  hence  some  client  invoked  a  write  at  this  timestamp,  satisfying  Femma  4.2.5. 

Also,  the  client  limits  the  amount  of  state  required  for  a  read  operation  by  considering  only  the 
most  recent  timestamp  proposed  by  each  server.  It  does  not  consider  more  timestamps  until  all  but 
/  servers  have  returned  a  response  for  all  timestamps  currently  under  consideration.  Servers  delay 
garbage  collection  by  a  few  seconds,  thus  obviating  the  need  for  fast  clients  to  ever  follow  garbage 
collection  redirection. 


4.4  Evaluation 

This  section  evaluates  the  protocol  proposed  in  this  chapter  (the  FP  protocol)  on  a  distributed  storage 
prototype.  Three  competing  protocols  that  are  described  in  Section  4.4. 1  were  also  implemented 
and  evaluated.  The  experimental  setup  is  described  in  Section  4.4.2.  Single  client  write  throughput, 
read  throughput,  and  response  time  are  evaluated  in  Sections  4.4.3,  4.4.4,  and  4.4.5,  respectively. 
An  analysis  of  the  protocol  presented  in  Section  4.2  suggests  that  the  protocol  should  perform 
similarly  to  a  benign  erasure-coded  protocol  with  the  additional  computational  expense  of  hashing 
and  the  extra  bandwidth  required  for  the  fpcc,  MACS,  and  nonces.  The  experimental  results  confirm 
that  the  fp  protocol  is  competitive  with  the  benign  erasure-coded  protocol  and  that  it  significantly 
outperforms  competing  approaches. 

4.4.1  Competing  Protocols 

To  enable  fair  comparison,  the  distributed  storage  prototype  supports  multiple  protocols.  Protocols 
are  compared  within  the  same  framework  to  ensure  that  measurements  reflect  protocol  variations 
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rather  than  implementation  artifacts.  The  following  thr'ee  protocols  are  evaluated  in  addition  to  the 
FP  protocol. 

Benign  erasure-coded  protocol:  The  prototype  implements  an  erasure-coded  storage  protocol 
that  tolerates  crashes  but  not  Byzantine  faulty  clients  or  servers.  This  protocol  uses  the  same  erasure 
coding  implementation  used  for  the  FP  protocol.  A  write  operation  encodes  a  block  into  ni-\-  f 
fragments  and  sends  these  fragments  to  servers.  A  read  operation  reads  from  the  first  m  servers, 
avoiding  the  need  to  decode.  Both  complete  in  a  single  round  of  communication.  This  protocol 
assumes  concurrency  is  handled  by  some  external  locking  protocol;  a  real  implementation  would 
require  more  rounds  of  communication  for  some  writes,  but  this  overhead  is  ignored.  Hence,  this 
protocol  provides  an  upper  bound  for  the  performance  of  any  erasure-coded  fault-tolerant  storage 
protocol  (Byzantine  or  not). 

Benign  replication-based  protocol:  The  prototype  implements  a  replication-based  storage  pro¬ 
tocol  that  tolerates  crashes  but  not  Byzantine  faulty  clients  or  servers.  A  write  operation  sends 
the  block  to  /  -I-  1  servers,  and  a  read  operation  reads  from  a  single  server.  As  in  the  benign 
erasure-coded  protocol,  this  protocol  assumes  concurrency  is  handled  by  an  external  locking  pro¬ 
tocol.  Hence,  though  this  protocol  does  not  tolerate  Byzantine  faults,  it  represents  an  upper  bound 
for  the  performance  of  any  replication-based  Byzantine  (or  not)  fault-tolerant  storage  protocols. 
It  is  worth  noting,  however,  that  this  implementation  is  substantially  faster  than  most  replication- 
based  Byzantine  fault-tolerant  storage  protocols  found  in  the  literature,  which  often  require  all-to-all 
broadcasts  [18,  23],  public-key  signatures  [19,  66],  or  writing  to  2/-|-  1  or  more  replicas  [18,  23]. 

Replication-based  protocols  are  excellent  for  reads,  but  write  performance  is  reduced  due  to 
bandwidth  limitations.  One  method  to  overcome  the  network  bandwidth  limitation  between  the 
client  and  the  switch  for  a  single  client  is  to  use  network-level  multicast.  The  replication-based 
protocol  does  not  use  multicast  for  several  reasons.  Multicast  is  unavailable,  unsuitable,  or  unstable 
in  many  network  environments  [44],  and  retransmissions  due  to  congestion  cannot  take  advantage 
of  multicast.  Also,  multicast  does  nothing  to  reduce  disk  bandwidth  and  network  bandwidth  be¬ 
tween  the  switch  and  the  servers.  Though  each  server  could  encode  its  own  block  to  reduce  disk 
bandwidth  [18],  a  multicast-based  protocol  would  not  scale  when  multiple  clients  are  writing  to  the 
same  storage  servers. 

m-i-3f  Byzantine  fault-tolerant  erasure-coded  protocol:  An  alternative  to  Byzantine  fault- 
tolerant  replication-based  storage  is  Byzantine  fault-tolerant  erasure-coded  storage.  The  prototype 
implements  a  protocol  similar  to  BASIS  [44].  BASIS  was  engineered  to  improve  server  throughput 
by  offloading  work  to  clients.  This  protocol  uses  the  same  erasure  coding  implementation  and 
SHA-1  library  used  for  the  FP  protocol.  The  protocol  implemented  by  the  prototype,  however,  only 
emulates  a  BASiS-like  protocol.  It  does  not  implement  the  versioning  storage  required  by  BASIS, 
nor  does  it  run  a  garbage  collection  protocol,  and,  hence,  it  would  not  be  suitable  for  storing  data  in 
a  Byzantine  environment.  This  implementation  does,  however,  provide  a  comparison  point  against 
the  approach  most  similar  to  the  FP  protocol. 

To  write  a  block,  the  client  requests  the  most  recent  timestamp  from  each  server.  It  then  encodes 
the  block  into  m  -I-  3/  fragments  and  hashes  each  fragment  to  create  a  cross-checksum.  (Because 
a  systematic  encoding  is  used,  “encoding”  the  first  /  fragments  does  not  require  computation.)  By 
comparison,  the  FP  protocol  encodes  and  hashes  m+ f  fragments;  /  fewer  because  there  are  /  fewer 
servers  and  another  /  fewer  due  to  the  partial  encoding  optimization,  described  in  Section  4.2.1.  To 
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complete  the  write,  the  m  +  3f  protocol  sends  fragments  to  the  first  m  +  2/  servers,  considering 
other  servers  only  if  some  servers  are  unresponsive.  By  comparison,  the  FP  protocol  and  the  benign 
erasure-coded  protocol  send  /  fewer  fragments  because  they  need  /  fewer  servers.  To  read  a  block, 
the  client  requests  fragments  from  the  first  m  servers  along  with  timestamps  from  the  first  3/  +  1 
servers.  Assuming  all  timestamps  match,  the  client  must  then  verify  the  cross-checksum,  which  is 
embedded  in  the  timestamp.  This  requires  repeating  the  write  computation:  the  client  must  encode 
and  hash  m  +  3/  fragments  to  recompute  the  cross-checksum. 

4.4.2  Experimental  Setup 

All  experiments  are  measured  using  a  single  client  and  a  collection  of  servers.  Each  machine  has 
a  dual-core  3  GHz  Pentium  D  processor,  2  GB  of  RAM,  and  an  Intel  PRO/ 1000  Gigabit  Ethernet 
controller,  and  machines  run  Einux  kernel  version  2.6.18.  Measurements  are  taken  in  the  absence 
of  concurrency  and  faults,  which  is  expected  to  be  the  normal  mode  of  operation  in  such  a  storage 
system.  The  client  and  the  servers  are  connected  to  the  same  HP  ProCurve  Switch  2848  with  QoS 
passthrough  mode  set  to  one-queue  and  flow  control  enabled  for  each  port.  Each  experiment  was  run 
10  times  for  60  seconds,  with  the  average  performance  reported  in  the  figures.  Sfandard  deviations 
are  all  within  2%  of  the  average,  and  performance  matches  analytical  expectations. 

The  working  set  of  data  for  each  experiment  is  chosen  to  fit  within  the  server  caches,  and  the 
client  does  not  cache  data.  The  data  gets  loaded  before  measurements,  ensuring  100%  read  hits,  and 
the  servers  use  write-back  with  synchronizing  to  disk  disabled.  The  systems  are  battery-backed, 
but  the  experimental  reason  for  this  setup  is  to  allow  the  measurement  of  protocol  overhead  rather 
than  disk  latency.  Avoiding  disk  accesses  makes  performance  dependent  on  the  network  and  com¬ 
putational  behavior  of  the  protocols.  If  the  working  set  does  not  fit  in  server  caches,  or  if  durability 
requirements  prevent  using  NVRAM  for  write-back,  the  choice  of  protocol  matters  less  because  sys¬ 
tem  performance  will  be  limited  by  disk  performance.  In  such  a  scenario,  there  is  an  even  stronger 
argument  for  using  a  Byzantine  fault-tolerant  protocol  rather  than  a  protocol  that  tolerates  only  crash 
faults. 

The  client  benchmark  program  is  run  on  a  single  machine.  It  generates  a  synthetic  workload. 
Eor  throughput  measurements,  the  client  spawns  several  parallel  threads,  each  of  which  issues  a  read 
or  write  request  for  a  randomly  selected  64  kB  block,  waits  for  the  response,  and  then  issues  another 
request.  Eor  response  time  measurements,  a  single  thread  issues  a  single  write  request,  waits  for  a 
response,  and  then  repeats.  Eor  erasure-coded  protocols  (all  but  replication),  m  =  f  +\.  Because 
the  block  size  is  fixed,  the  fragment  size  for  erasure-coded  protocols  decreases  as  m  increases  (i.e., 
fragment  size  is  64/m  kB). 

4.4.3  Write  Throughput 

Eigure  4.4.4  shows  the  write  throughput  achieved  by  a  single  client  executing  each  of  the  four  pro¬ 
tocols  as  a  function  of  the  number  of  faults  tolerated.  The  FP  protocol  significantly  outperforms  the 
Byzantine  fault-tolerant  m  -|-  3/  erasure-coded  protocol  as  well  as  the  crash  fault-tolerant  replication- 
based  protocol,  and  it  nearly  matches  the  performance  of  the  benign  erasure-coded  protocol.  Eor 
example,  at  /  =  4,  the  FP  protocol  achieves  a  factor  of  2.6  higher  throughput  than  replication,  a 
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Number  of  faults  tolerated  (f) 


Figure  4.4.4:  Write  throughput  for  each  protocol  as  a  function  of  faults  tolerated.  The  lines  report  the 
performance  of  each  protocol  when  writing  64  kB  blocks.  The  four  circles  report  the  performance 
of  the  FP  protocol  when  writing  16  kB  fragments  (rather  than  64/m  kB). 


factor  of  1.4  higher  throughput  than  the  m  +  3/  protocol,  and  is  within  5%  of  the  performance  of 
the  erasure-coded  protocol  that  does  not  tolerate  Byzantine  faulty  servers  or  clients. 

Each  protocol  requires  a  different  number  of  servers  to  tolerate  the  same  number  of  faults.  The 
benign  erasure-coded  protocol  and  the  FP  protocol  require  m-\-f  responsive  servers,  the  replication- 
based  protocol  requires  /  -I-  1  servers,  and  the  m  3/  protocol  requires  m  2/  responsive  servers. 
(Both  Byzantine  fault-tolerant  protocols  must  be  able  to  reach  an  additional  /  servers  if  some  of 
these  servers  are  not  responsive.)  For  example,  for  /  =  4  and  m  =  5,  the  benign  erasure-coded  pro¬ 
tocol  and  the  FP  protocol  write  data  to  9  servers,  the  replication-based  protocol  writes  to  5  servers, 
and  the  m  3/  protocol  writes  to  13  servers. 

Throughput  is  the  amount  of  useful  data  written,  which  is  less  than  the  amount  of  data  sent  over 

I B I 

the  network.  The  FP  protocol  and  the  benign  erasure-coded  protocol  both  send  {m  -T  f)  bytes 

when  writing  a  block  of  | B |  bytes.  Replication  must  send  | B |(/ -|-  1 )  bytes,  and  the  m -\- 3/  protocol 

I  Bl 

must  send  (m  -|-  If)  bytes.  The  erasure-coded  protocols  could  increase  throughput  for  constant  / 
by  increasing  m  beyond  f  +1. 

The  benign  erasure-coded  protocol  performs  well,  as  expected,  achieving  a  write  throughput 
close  to  of  the  total  network  bandwidth  available.  The  FP  protocol  performs  almost  as  well. 
When  tolerating  up  to  6  Byzantine  faulty  servers,  it  performs  within  10%  of  the  benign  protocol  that 
only  tolerates  server  crashes.  As  the  number  of  servers  in  the  system  grows,  however,  the  additional 
network  overhead  in  the  FP  protocol  becomes  noticeable  for  two  reasons.  First,  because  block  size  is 
constant,  the  size  of  the  fragment  written  at  each  server  decreases  as  the  number  of  servers  increases. 
Second,  the  sizes  of  the  fpcc,  MACS,  and  nonces  increase  as  the  number  of  servers  increases.  For 
/  =  6,  fragment  size  is  over  9  kB  and  fpcc,  MAC,  and  nonce  overhead  is  under  700  bytes  (overhead 
is  under  7%  of  data  sent).  For  /  =  10,  fragment  size  is  under  6  kB  and  fpcc,  MAC,  and  nonce 
overhead  is  over  1100  bytes  (overhead  is  over  15%  of  data  sent). 
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Figure  4.4.5 :  Read  throughput  as  a  function  of  faults  tolerated. 


One  solution  to  this  problem  is  to  increase  the  block  size.  For  example,  the  four  circles  in 
Figure  4.4.4  show  the  thr'oughput  of  the  FP  protocol  when  fragment  size  is  16  kB.  The  FP  protocol 
performs  within  10%  of  the  benign  protocol  when  the  fragment  size  is  increased  to  16  kB  for  both 
protocols,  even  when  tolerating  10  faults.  (The  benign  protocol  performs  less  than  3%  better  when 
the  fragment  size  is  increased  to  16  kB.) 

The  replication-based  protocol  performs  poorly  for  all  but  the  smallest  number  of  faults  tol¬ 
erated,  as  expected,  because  it  writes  {f  +  l)/(^^^)  >  (/+  l)/2  times  as  much  data  as  the  be¬ 
nign  erasure-coded  protocol.  The  m  4-  3/  Byzantine  fault-tolerant  erasure-coded  protocol  writes 
1.5  times  as  much  data  as  the  benign  erasure-coded  protocol,  and  up  until  about  /  =  4  it 
is  only  a  factor  of  1.5  times  worse.  The  m  +  3f  protocol,  however,  must  encode  and  hash  m  +  3/ 
fragments  to  generate  the  cross-checksum  even  though  it  only  writes  to  m  +  2f  servers  because 
it  does  not  include  the  partial  encoding  optimization.  Hence,  for  /  >  4,  the  m  +  3/  protocol  is 
computationally  bound  by  the  client. 


4.4.4  Read  Throughput 

Figure  4.4.5  shows  the  read  throughput  achieved  by  a  single  client  executing  each  of  the  four  proto¬ 
cols  as  a  function  of  the  number  of  faults  tolerated.  The  FP  protocol  achieves  read  throughput  within 
10%  of  the  two  benign  protocols  and  significantly  outperforms  the  m  +  3/  protocol.  The  slight  drop 
for  the  FP  protocol  and  the  benign  erasure-coded  protocol  as  the  number  of  faults  tolerated  increases 
is  due  to  network  congestion  caused  by  the  increasing  number  of  servers  providing  responses. 

All  four  protocols  read  the  same  amount  of  data.  The  replication-based  protocol  reads  an  entire 
64  kB  block  from  a  single  server,  and  the  other  protocols  read  fragments  from  m  servers.  The 
erasure-coded  protocols  read  from  the  first  m  servers  to  avoid  the  need  to  decode.  In  addition  to 
fragments,  the  Byzantine  fault-tolerant  protocols  must  read  timestamps  from  more  servers  to  check 
for  concurrency.  The  FP  protocol  reads  timestamps  from  2f  +  \  —  m  =  f  more  servers,  while  the 
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Number  of  faults  tolerated  (f) 

Figure  4.4.6:  Write  response  time  as  a  function  of  faults  tolerated. 
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Figure  4.4.7:  Write  response  time  breakdown  for  /  =  10. 


m  +  3/  protocol  reads  timestamps  from  3/  +  1  —  m  =  2/  more  servers.  Assuming  all  timestamps 
match,  read  completes  in  a  single  round  of  communication. 

Once  fragments  are  read,  the  Byzantine  fault-tolerant  protocols  must  verify  data.  The  FP  pro¬ 
tocol  requires  just  a  hash  and  a  fingerprint  of  the  fragments.  The  m  3/  protocol,  however,  must 
recompute  the  cross-checksum,  which  requires  encoding  and  hashing  m  -T  3/  fragments  and  is  quite 
expensive  for  large  values  of  /. 

4.4.5  Response  Time 

Figure  4.4.6  shows  the  response  time  of  a  single  write  for  each  of  the  four  protocols  as  a  function 
of  the  number  of  faults  tolerated.  The  FP  protocol  requires  on  average  1 .04  ms  more  to  complete  a 
write  operation  than  the  benign  erasure-coded  protocol,  which  is  on  average  1.65  times  worse.  This 
is,  however,  a  substantial  improvement  over  the  m  3/  protocol  and  the  replication-based  protocol, 
which  both  scale  worse  than  the  FP  protocol. 

Figure  4.4.7  provides  a  breakdown  of  the  average  latency  of  each  operational  component  of 
a  write  for  /=  10  as  seen  by  the  client.  The  table  lists  the  time  each  protocol  spent  encoding, 
hashing,  and  fingerprinting  fragments  (other  computational  contributions  were  negligible);  it  also 
lists  the  time  spent  waiting  for  the  network,  which  includes  time  spent  in  the  kernel.  As  seen  in  the 
table,  about  half  of  the  additional  latency  for  a  write  by  the  FP  protocol  as  compared  to  the  benign 
erasure-coded  protocol  is  due  to  hashing  and  fingerprinting,  and  the  other  half  is  due  to  the  extra 
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Figure  4.4.8:  Read  response  time  as  a  function  of  faults  tolerated. 
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Figure  4.4.9:  Read  response  time  breakdown  for  /  =  10. 


round  of  communication.  The  additional  latency  for  the  replication-based  protocol  is,  of  course,  due 
to  the  extra  bandwidth  required  to  write  /  replicas.  The  additional  latency  for  the  m  +  3/  protocol 
is  due  to  the  encoding  of  2/  more  fragments,  the  extra  round  of  communication,  the  sending  of 
1.5  times  as  many  fragments,  and  the  hashing  of  m  +  3/  fragments. 

Read  response  time  (Figure  4.4.8)  is  as  expected.  Table  4.4.9  provides  a  breakdown  of  the 
average  latency  of  each  operational  component  of  a  read  for  /=  10  as  seen  by  the  client.  The 
benign  erasure-coded  protocol  and  the  replication  protocol  require  on  average  0.84  ms  and  0.80  ms 
respectively  to  read  a  single  block  when  tolerating  between  one  and  ten  faults.  The  FP  protocol 
requires  on  average  1.29  ms,  the  difference  being  the  time  needed  to  hash  and  fingerprint  fragments. 
Each  of  these  protocols  requires  about  the  same  amount  of  time  to  read  a  block  when  tolerating 
one  fault  as  when  tolerating  ten  faults.  The  m  +  3f  protocol  requires  3.05  ms  on  average,  and  it 
requires  1.58  ms  to  read  a  single  block  when  tolerating  one  fault  but  4.54  ms  when  tolerating  ten 
faults.  It  scales  worse  than  the  other  protocols  because  it  must  encode  and  hash  m  +  3f  fragments 
to  recompute  the  cross-checksum. 


4.5  Conclusion 

Distributed  block  storage  systems  can  tolerate  Byzantine  faults  in  asynchronous  environments  with 
little  overhead  over  systems  that  tolerate  only  crashes.  Replication-based  block  storage  protocols 
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are  effective  for  workloads  that  are  mostly  reads  or  when  tolerating  a  single  fault,  but  exhibit  low 
throughput  and  high  latency  for  large  writes.  Erasure-coded  protocols  provide  higher  throughput 
writes  and  can  increase  m  for  fixed  /  to  realize  even  higher  throughput.  Previous  Byzantine  fault- 
tolerant  erasure-coded  protocols,  however,  exhibit  low  client  throughput  for  reads  and  high  com¬ 
putational  overheads  for  both  reads  and  writes.  This  chapter  presents  the  FP  protocol,  a  Byzantine 
fault-tolerant  erasure-coded  protocol  that  performs  well  for  both  reads  and  large  writes.  Measure¬ 
ments  of  a  prototype  implementation  demonstrate  that  this  protocol  exhibits  throughput  within  10% 
of  the  ideal  crash  fault-tolerant  erasure-coded  protocol  for  reads  and  sufficiently  large  writes.  Fur¬ 
thermore,  the  FP  protocol  has  little  computational  overhead  other  than  a  cryptographic  hash  and  a 
homomorphic  fingerprint  of  the  data. 


Chapter  5 


Scalable  Fault  Tolerance  through 
Byzantine  Locking 


As  distributed  systems  grow  in  size  and  importance,  they  must  tolerate  complex  software  bugs  and 
hardware  misbehaviors  in  addition  to  simple  crashes  and  lost  messages.  Byzantine  fault-tolerant 
protocols  can  tolerate  arbitrary  problems,  making  them  an  attractive  building  block — in  theory.  But, 
in  practice,  system  designers  continue  to  worry  that  their  performance  overheads  and  scalability 
limitations  are  too  great.  Recent  research  has  improved  performance  by  exploiting  optimism  to 
improve  common  cases,  but  a  significant  gap  still  exists. 

The  Zzyzx  replicated  state  machine  protocol  bridges  that  gap  with  a  new  technique  called 
Byzantine  Locking.  Layered  atop  a  Byzantine  fault-tolerant  replicated  state  machine  protocol  (e.g., 
PBFT  [24]  or  Zyzzyva  [55]),  Byzantine  Locking  can  be  used  to  temporarily  give  a  client  exclu¬ 
sive  access  to  state  in  the  replicated  state  machine.  It  uses  the  underlying  Byzantine  fault-tolerant 
replicated  state  machine  to  extract  the  relevant  state  and,  later,  to  re-integrate  it.  Unlike  locking 
in  non-Byzantine  fault-tolerant  systems,  Byzantine  Locking  is  only  a  performance  tool.  To  ensure 
liveness,  locked  state  is  kept  on  servers,  and  a  client  that  tries  to  access  objects  locked  by  another 
client  can  request  that  the  locks  be  revoked,  forcing  both  clients  back  to  the  underlying  replicated 
state  machine  to  ensure  consistency. 

Byzantine  Locking  provides  unprecedented  scalability  and  efficiency  for  the  common  case  of 
infrequent  concurrent  data  sharing.  Most  notably,  the  server  processes  to  which  locked  state  is 
extracted  for  servicing  operations  by  the  locking  client — in  the  parlance  of  Byzantine  Locking,  log 
servers — can  execute  on  distinct  physical  computers  from  the  replicas  for  the  underlying  replicated 
state  machine.  Thus,  multiple  log  server  groups,  each  running  on  distinct  physical  computers,  can 
be  used  for  independently  locked  state,  allowing  thr'oughput  to  be  scaled  by  adding  computers,  as 
shown  in  Figure  5.0.1.  Even  when  running  the  log  servers  on  the  same  computers  as  the  underlying 
replicated  state  machine,  exclusive  access  allows  clients  to  execute  a  sequence  of  operations  much 
more  efficiently  (just  one  round-trip  with  only  2/-I-1  responses,  where  /  if  the  number  of  faulty 
servers  tolerated),  because  concurrency  is  explicitly  precluded.  This  performance  benefit  can  be 
seen  in  Figure  5.0.1  by  comparing  the  4-server  throughputs  of  Zyzzyva  and  Zzyzx — Byzantine 
Locking  enables  the  factor  of  2.9  x  higher  throughput  shown. 
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Number  of  responsive  servers 


Figure  5.0.1:  Throughput  vs.  servers.  Zzyzx’s  throughput  scales  nearly  linearly  as  servers  are 
added.  Zyzzyva  does  not  use  additional  servers  to  improve  throughput,  so  the  dashed  line  repeats  its 
4-server  throughput  for  reference.  Even  with  the  minimum  number  of  servers  (4),  Zzyzx  significantly 
outperforms  Zyzzyva.  The  measured  workload  includes  no  faults  or  data  sharing  among  clients.  All 
configurations  measured  tolerate  one  Byzantine  fault  (f=l ).  Section  5.5  describes  the  experimental 
setup  and  results  in  detail. 


Zzyzx  implements  Byzantine  Locking  on  top  of  Zyzzyva  [55],  a  state-of-the-art  Byzantine  fault- 
tolerant  replicated  state  machine,  to  improve  performance  while  providing  the  same  correctness 
and  liveness  guarantees  as  Zyzzyva.  Experiments,  described  in  Section  5.5,  show  that  Zzyzx  can 
provide  39-43%  lower  latency  and  a  factor  of  2.2-2.9x  higher  throughput  when  using  the  same 
servers,  compared  to  Zyzzyva,  for  operations  on  locked  objects.  Postmark  [52]  completed  60% 
more  transactions  on  a  Zzyzx-based  file  system  than  one  based  on  Zyzzyva,  and  Zzyzx  provided 
a  factor  of  1.6  x  higher  thr'oughput  for  a  trace-based  metadata  workload.  The  benefits  of  locking 
outweigh  the  cost  of  unlocking  after  as  few  as  ten  operations.  Operations  on  concurrently  shared 
data  objects  do  not  use  the  Byzantine  Locking  layer — clients  just  execute  the  underlying  Zyzzyva 
protocol  directly.  Thus,  except  when  transitioning  objects  from  unshared  to  shared,  the  common 
case  (unshared)  proceeds  with  maximal  efficiency  and  the  uncommon  case  is  no  worse  off  than  the 
underlying  Byzantine  fault-tolerant  replicated  state  machine. 

Although  it  will  provide  correct  behavior  under  any  workload,  the  benefits  of  Byzantine  Locking 
will  be  realized  most  in  services  whose  state  consist  of  many  objects  that  are  rarely  shared.  This 
characterization  fits  many  critical  services  for  which  both  scalability  and  Byzantine  fault  tolerance 
is  desirable.  Lor  example,  the  metadata  service  of  most  distributed  file  systems  contains  a  distinct 
object  for  each  file  or  directory,  and  concurrent  sharing  is  rare  [10].  Similarly,  a  distributed  key- 
value  store  for  website  personalization  [26]  or  e-commerce  shopping  carts  [34]  may  have  many 
concurrent  writers  that  access  distinct  keys. 

This  chapter  makes  three  primary  contributions.  Lirst,  it  introduces  Byzantine  Locking  as  a 
means  of  realizing  unprecedented  scalability  and  efficiency  for  Byzantine  fault-tolerant  replicated 
state  machines  when  concurrent  data  sharing  is  uncommon.  Second,  it  describes  Zzyzx,  a  Byzan- 
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tine  fault-tolerant  replicated  state  machine  protocol  that  layers  Byzantine  Locking  atop  Zyzzyva 
to  demonstrate  the  performance  and  scalability  features  of  Byzantine  Locking.  Third,  it  uses  the 
Zzyzx  prototype  to  evaluate  the  performance  characteristics  of  Byzantine  Locking  and  shows  that 
Byzantine  Locking  provides  the  expected  scaling  and  significant  performance  improvements  for 
collections  of  usually-independent  objects. 


5.1  Context  and  Related  Work 

Large  distributed  systems  exhibit  more  faults  and  more  types  of  faults  than  traditional  fault-tolerance 
techniques  can  manage.  For  example,  a  single  misdirected  write  can  corrupt  data  in  protocols  such 
as  Paxos.  In  response,  many  practitioners  use  more  robust  protocols.  Modern  systems  include  a 
variety  of  checksums  and  consistency  checks,  but  ad-hoc  robustness  mechanisms  do  not  capture 
some  real  failure  modes  [9,  25,  60].  Efficient  protocols  that  survive  the  fuller  range  of  failure 
modes  are  important  for  the  critical  services  needed  by  cloud  computing,  data  centers,  and  clustered 
storage  systems,  where  the  higher  frequency  of  faults,  wider  diversity  of  corruptions  and  faults  seen 
in  practice,  and  asynchronous  network  behavior  break  traditional  techniques. 

As  fault  tolerance  has  grown  more  important  and  techniques  to  achieve  fault  tolerance  less 
expensive,  there  has  been  a  natural  progression  of  distributed  protocols  used  by  practitioners,  from 
unreplicated  to  replicated,  synchronous  to  asynchronous,  and  crash-tolerant  to  ad-hoc  consistency 
checks.  The  next  step  is  from  ad-hoc  consistency  to  full  Byzantine  fault  tolerance. 

Byzantine  fault-tolerant  protocols  tolerate  any  number  of  faulty  or  malicious  clients  and  a  frac¬ 
tion  of  faulty  or  malicious  servers.  Byzantine  fault  tolerance  ensures  that  all  bases  are  covered, 
protecting  against  misdirected  writes,  soft  errors,  and  other  faults  and  corruptions  found  in  modern 
hardware  and  software.  A  Byzantine  fault-tolerant  replicated  state  machine  protocol  can  be  used 
to  implement  any  deterministic  service.  Given  the  growth  in  size  and  importance  of  many  dis¬ 
tributed  services,  one  would  like  to  use  Byzantine  fault-tolerant  replicated  state  machines  to  make 
such  services  more  robust.  Toward  that  end,  much  recent  research  has  focused  on  designing  Byzan¬ 
tine  fault-tolerant  replicated  state  machine  protocols  with  improved  performance  and  scalability, 
especially  during  fault-free  periods  of  operation. 


5.1.1  The  Byzantine  Efficiency  Race 

Recent  years  have  seen  something  of  an  arms  race  among  researchers  seeking  to  provide  applica¬ 
tion  writers  with  efficient  Byzantine  fault-tolerant  substrates.  Perhaps  unintentionally,  Castro  and 
Liskov  [24]  initiated  this  race  in  proposing  a  new  protocol  and  labeling  it  “practical,”  because  it 
demonstrated  performance  considerably  better  than  most  expected  could  be  achieved  with  Byzan¬ 
tine  fault-tolerant  systems.  Their  protocol  replaces  the  digital  signatures  common  in  previous  proto¬ 
cols  with  message  authentication  codes  (MACs)  and  also  increases  efficiency  with  request  batching, 
link-level  broadcast,  and  agreement-free  optimistic  reads  [21].  Still,  the  protocol  requires  four  mes¬ 
sage  delays  and  all-to-all  communication,  for  all  mutating  operations,  leaving  room  for  improve¬ 
ment.  Castro  and  Liskov  called  their  protocol  “BET”.  To  avoid  confusion  with  the  acronym  for 
Byzantine  fault  tolerance,  this  paper  follows  the  convention  of  calling  their  protocol  PBET. 
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Figure  5.1.2:  Comparison  of  Byzantine  fault-tolerant  replicated  state  machine  protocols  in 
the  absence  of  faults  and  contention,  along  with  commonly  accepted  lower  bounds.  Data  for 
PBFT,  Q/U,  HQ,  and  Zyzzyva  are  taken  from  [55].  Bold  entries  identify  best-known  values.  / 
denotes  the  number  of  server  faults  tolerated,  and  B  denotes  the  request  batch  size  (see  Section  5.5). 
“Responsive  servers  needed”  refers  to  the  number  of  servers  that  must  respond  in  order  to  achieve 
good  performance.  *The  throughput  scalability  provided  by  quorum  protocols  is  limited  by  the 
requirement  for  overlap  between  valid  quorums  [69,  71]. 


Abd-el-Malek  et  al.  [2]  proposed  Q/U,  a  quorum-based  Byzantine  fault-tolerant  protocol  that 
exploits  speculation  and  quorum  constructions  to  provide  throughput  that  can  increase  somewhat 
with  addition  of  servers.  Q/U  provides  Byzantine  fault-tolerant  operations  (including  multi-object 
operations)  on  a  collection  of  objects  comprised  of  state  and  associated  operations.  Operations  are 
optimistically  executed  in  just  one  round-trip,  and  object  histories  are  used  to  resolve  issues  created 
by  concurrency  or  failures.  Fortunately,  both  are  expected  to  be  rare  in  many  important  usages.  One 
example  of  this  is  the  file  servers  that  have  been  used  as  concrete  examples  in  papers  on  this  topic 
(e.g.,  [24]).  Matching  conventional  wisdom,  analysis  of  NFS  traces  from  a  departmental  server  [38] 
confirms  thaf  most  files  are  used  by  a  single  clienf  and  fhat,  when  a  file  is  shared,  there  is  almost 
always  only  one  client  using  it  at  a  time.  For  example,  in  the  traces  examined,  fewer  than  one  in 
one  thousand  operations  would  experience  any  contention  in  a  replicated  state  machine  protocol.  If 
one  discounts  the  two  most  contentious  file  handles,  less  than  one  in  ten  thousand  operations  would 
experience  contention  on  average. 

Cowling  et  al.  [32]  proposed  HQ,  which  uses  a  hybrid  approach  to  achieve  the  benefits  of  Q/U 
without  the  increased  minimum  number  of  servers  (3/-I-1  for  HQ  vs.  5/-I-1  for  Q/U).  Optimistically, 
an  efficient  quorum  protocol  executes  operations  unless  concuiTency  or  failures  are  detected.  Each 
operation  that  encounters  such  issues  then  executes  a  second  protocol  to  achieve  correctness.  In 
reducing  the  number  of  servers,  HQ  increases  the  common  case  number  of  message  delays  for 
mutating  operations  to  four  (two  roundtrips). 

Most  recently,  Kotla  et  al.  [55]  proposed  Zyzzyva,  which  avoids  all-to-all  communication  with¬ 
out  additional  servers,  performs  better  than  HQ  under  contention,  and  requires  only  three  message 
delays.  Unlike  other  protocols,  however,  Zyzzyva  requires  that  all  3/  -|-  1  nodes  are  responsive  in  or¬ 
der  to  achieve  good  performance;  timeouts  trigger  a  second  protocol  phase.  Unfortunately,  whether 
by  misfortune  or  by  design,  some  servers  in  many  distributed  systems  will  sometimes  respond  more 
slowly  than  others.  Furthermore,  requiring  that  all  3/  -|-  1  servers  respond  to  avoid  extra  work  pre¬ 
cludes  techniques  that  reduce  the  number  of  servers  needed  in  practice.  For  example,  if  only  2/  +  1 
servers  need  be  responsive,  the  /  “non-responsive”  servers  can  be  shared  by  neighboring  Byzantine 
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fault-tolerant  clusters  or  used  to  fill  other  needs  without  being  burdened  under  normal  operation. 
Non-responsive  servers  could  even  be  turned  off  to  save  power,  at  the  cost  of  higher  latency  when 
faults  occur. 

In  a  recent  study  of  several  Byzantine  fault-tolerant  replicated  state  machine  protocols,  Singh 
et  al.  concluded  that  “one-size-fits-all  protocols  may  be  hard  if  not  impossible  to  design  in  prac¬ 
tice”  [96].  They  note  that  “different  performance  trade-offs  lead  to  different  design  choices  within 
given  network  conditions.”  Indeed,  there  are  several  parameters  to  consider,  including  the  total 
number  of  replicas,  the  number  of  replicas  that  must  be  responsive  for  good  performance,  the  num¬ 
ber  of  message  delays  in  the  common  case,  the  performance  under  contention,  and  the  throughput, 
which  is  roughly  a  function  of  the  numbers  of  cryptographic  operations  and  messages  per  request. 

Unfortunately,  none  of  the  above  protocols  score  well  on  all  of  these  metrics,  as  shown  in  Fig¬ 
ure  5.1.2.  PBFT  requires  four  message  delays  and  all-to-all  communication,  Q/U  requires  additional 
replicas,  HQ  requires  four  message  delays  and  performs  poorly  under  contention,  and  Zyzzyva  per¬ 
forms  poorly  unless  all  nodes  are  responsive.  (ZyzzyvaS,  a  variant  of  Zyzzyva,  performs  well  even 
when  some  nodes  are  not  responsive,  but  requires  additional  replicas  [55].) 


5.1.2  How  Zzyzx  Fits  In 

Like  the  systems  above,  Zzyzx  is  optimized  to  perform  well  in  environments  where  faults  are  rare 
and  concurrency  is  uncommon,  while  providing  correct  operation  under  harsher  conditions.  During 
benign  periods,  Zzyzx  outperforms  and  scales  better  than  all  of  the  prior  approaches,  requiring 
the  minimum  possible  numbers  of  message  delays  (two,  which  equals  one  round-trip),  responsive 
servers  (2/-I-1),  and  total  servers  (3/-I-1).  Zzyzx  provides  unprecedented  scalability,  because  it  does 
not  require  overlapping  quorums  as  in  prior  protocols  (HQ  and  Q/U)  that  provide  any  scaling;  non¬ 
overlapping  server  sets  can  be  used  for  frequently  unshared  state.  When  concurrency  is  common, 
Zzyzx  performs  similarly  to  its  underlying  protocol  (e.g.,  Zyzzyva). 

Zzyzx  takes  inspiration  from  the  locking  mechanisms  used  by  many  distributed  systems  to 
achieve  high  performance  in  benign  environments.  For  example,  GPFS  uses  distributed  locking 
to  provide  clients  byte-range  locks  that  enable  its  massive  parallelism  [92].  In  benign  fault- tolerant 
environments,  where  lockholders  may  crash  or  be  unresponsive,  other  clients  or  servers  must  be 
able  to  break  the  lock.  To  tolerate  Byzantine  faults,  the  protocol  must  additionally  ensure  that  lock 
semantics  are  not  violated  by  faulty  servers  or  clients  and  that  a  broken  lock  is  always  detected  by 
correct  clients.  Section  5.3  details  how  this  is  accomplished  for  Byzantine  Locking. 

By  allowing  clients  to  acquire  locks,  and  then  only  allowing  clients  that  have  the  lock  on  given 
state  to  execute  operations  on  it,  Zzyzx  achieves  much  higher  efficiency  for  sequences  of  operations 
from  that  client.  Each  replica  can  proceed  on  strictly  local  state,  given  evidence  of  lock  ownership, 
thus  avoiding  all  inter-replica  communication.  Also,  locked  state  can  be  can  be  transferred  to  other 
servers,  allowing  non-overlapping  sets  of  servers  to  handle  independently  locked  state. 

Zzyzx  handles  concurrency  in  a  similar,  though  reverse,  manner  to  HQ.  In  particular,  clients  are 
not  required  to  use  the  Byzantine  Locking,  but  they  can  do  so  when  concurrency  is  not  expected. 
So,  whereas  HQ  falls  back  on  a  protocol  such  as  PBFT  or  Zyzzyva  that  handles  concurrency  well 
each  time  concurrency  is  discovered,  Zzyzx  can  use  PBFT  or  Zyzzyva  natively  until  concurrency  is 
determined  to  be  uncommon,  at  which  point  clients  begin  using  Byzantine  Locking.  Thus,  Zzyzx 
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can  avoid  performance  losses  under  concurrency,  if  good  policy  decisions  are  made,  while  gaining 
performance  and  scalability  for  rarely-shared  state. 

5.1.3  Prior  Byzantine  Fault-Tolerant  Replicated  State  Machine  Protocols 

Recent  Byzantine  fault-tolerant  replicated  state  machine  protocols,  such  as  PBFT,  Q/U,  HQ, 
Zyzzy  va,  and  Zzyzx  built  upon  several  years  of  prior  distributed  systems  research  [17,36,  54, 63,  69, 
86,  87].  For  example,  Reiter  [86]  proposed  the  Rampart  toolkit,  which  implements  an  asynchronous 
Byzantine  fault-tolerant  replicated  state  machine.  Rampart  uses  an  atomic  multicast  protocol  and 
a  secure  membership  protocol.  The  atomic  multicast  protocol  is  similar  to  PBFT,  except  the  pri¬ 
mary  in  PBFT  is  called  the  sequencer  in  Rampart.  Public-key  cryptographic  overhead  limits  the 
performance  of  Rampart. 

Rampart’s  membership  protocol  highlights  the  importance  of  the  system  model.  In  Reiter’s 
model,  there  are  an  unbounded  number  of  servers,  any  number  can  be  faulty,  but  only  a  subset  are 
group  members,  and  correctness  depends  upon  fewer  than  a  third  of  the  group  members  being  faulty. 
Castro  and  Liskov  fix  group  membership  [24],  and  correctness  depends  upon  fewer  than  a  third  of 
servers  total  being  faulty.  Castro  and  Liskov  note  that  if  too  many  correct  servers  are  incorrectly  re¬ 
moved  from  the  group  in  Rampart,  the  remaining  faulty  servers  may  violate  correctness  [24,  Section 
8].  One  benefit  of  a  group  membership  model  is  that  such  protocols  can  perform  well  and  ensure 
correctness  in  an  environment  with  many  faulty  and  non-responsive  servers  so  long  as  the  group 
invariant  is  maintained.  To  emulate  the  fixed  group  membership  in  Castro  and  Liskov’s  model. 
Rampart  can  simply  lower  the  ranking  of  group  members  that  would  be  removed  (i.e.,  remove  and 
rejoin  suspected  members). 

Several  protocols  use  unreliable  failure  detectors  [36,  53,  54,  67],  which  allow  for  a  modular 
protocol  design.  For  example,  the  SecureRing  protocol  uses  an  unreliable  failure  detector  to  pro¬ 
vide  an  asynchronous  Byzantine  fault-tolerant  group  communication  abstractiong  [54],  which  can 
be  used  to  build  a  replicated  state  machine.  As  in  Rampart,  SecureRing  uses  a  secure  group  member¬ 
ship  protocol.  Rampart  [87]  and  SecureRing  [54]  also  batch  operations  to  minimize  cryptographic 
and  network  overhead  (SecureRing  calls  batching  packing). 

5.1.4  Additional  Related  Work 

There  has  been  similar  progress  on  Byzantine  fault-tolerant  read/write  protocols,  which  can  be  used 
to  build  robust  block  storage  systems.  Recent  protocols  have  included  PASIS  [44]  and  AVID  [18]. 
Most  recently,  Hendricks  et  al.  [48]  proposed  a  protocol  that  demonstrates  <10%  overhead  for  the 
large  data  objects  that  characterize  high-bandwidth  storage  applications. 

Farsite  [4,  35]  uses  a  Byzantine  fault-tolerant  replicated  state  machine  to  manage  metadata  in  a 
distributed  file  sysfem.  Farsite  issues  leases  to  clients  for  metadata  such  that  clients  can  update  meta¬ 
data  locally,  which  can  increase  scalability.  Upon  conflict  or  timeout,  leases  are  recalled,  but  updates 
may  be  lost  if  the  client  is  unreachable.  Leasing  schemes  do  not  provide  the  strong  consistency  guar¬ 
antees  expected  of  replicated  state  machines  (linearizability  [51]),  so  leasing  is  not  acceptable  for 
some  applications.  Also,  choosing  lease  timeouts  presents  an  additional  challenge:  a  short  timeout 
increases  the  probability  that  a  client  will  miss  a  lease  recall  or  renewal,  but  a  long  timeout  may 
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stall  other  clients  needlessly  in  case  of  failure.  Farsite  expires  leases  after  a  few  hours  [35],  which 
is  acceptable  only  because  Farsite  is  not  designed  for  large-scale  write  sharing  [13]. 

To  scale  metadata  further,  Farsite  hosts  metadata  on  multiple  independent  replicated  state  ma¬ 
chines  called  directory  groups.  To  ensure  namespace  consistency,  Farsite  uses  a  special-purpose 
subprotocol  to  support  Windows-style  renames  across  directory  groups  [35].  This  subprotocol  al¬ 
lows  Farsite  to  scale,  but  it  is  inflexible  and  does  not  generalize  to  other  operations.  For  example, 
the  subprotocol  cannot  handle  POSIX-style  renames  [35] .  Byzantine  Locking  would  allow  Farsite 
and  similar  protocols  to  maintain  scalability  without  resorting  to  a  special-purpose  protocol. 

Yin  et  al.  [107]  describe  an  architecture  in  which  agreement  on  operation  order  is  separated 
from  operation  execution,  allowing  execution  to  occur  on  distinct  servers.  But,  the  replicated  state 
machine  protocol  is  never  relieved  of  the  task  of  ordering  operations,  and  as  such,  it  remains  a 
bottleneck  to  performance. 

Recent  research  has  provided  techniques  for  improved  uncommon  case  performance,  which 
would  complement  Zzyzx’s  improvement  of  common  case  performance.  Clement  et  al.  [30]  im¬ 
prove  the  performance  of  Byzantine  fault-tolerant  systems  under  attack.  Singh  et  al.  [95]  propose 
using  a  pre-serializer  to  mask  conflicts  in  quorum-based  protocols,  which  can  improve  performance 
in  the  presence  of  contention. 

Dividing  the  state  machine  into  objects,  as  is  required  to  achieve  the  benefits  of  Zzyzx,  has 
been  used  in  many  previous  replicated  state  machine  systems.  (Quorum  systems  frequently  use 
this  technique  as  well,  as  discussed  above.)  For  example,  in  their  work  preceding  Zyzzyva,  Kotla 
et  al.  [57]  describe  CBASE,  which  partitions  a  state  machine  into  objects  and  allows  concurrent 
execution  of  operations  known  to  involve  only  independent  state.  Rodrigues  et  al.  [88]  describe 
BASE,  which  partitions  a  state  machine  into  objects  to  allow  independent  recovery  of  (abstract 
views  of)  state  across  non-identical  replicas. 

Much  progress  has  also  been  made  on  improving  non-Byzantine  fault-tolerant  replicated  state 
machine  protocols,  such  as  Paxos  [64].  Chandra  et  al.  [25]  hardened  a  production  Paxos  imple¬ 
mentation  with  ad-hoc  consistency  checks  to  tolerate  some  corruptions.  Mao  et  al.  [72]  proposed 
Mencius,  a  Paxos-like  protocol  in  which  the  leader  rotates  to  improve  throughput. 


5.2  Definitions  and  System  Model 

This  chapter  makes  the  same  assumptions  about  network  asynchrony  and  the  security  of  crypto¬ 
graphic  primitives  (e.g.,  MACS,  signatures,  and  hash  functions),  and  offers  the  same  guarantees  of 
liveness  and  correctness  (linearizability),  as  the  most  closely  related  prior  works  [2,  24,  32,  55]. 
Zzyzx  tolerates  up  to  /  Byzantine  faulty  servers  and  any  number  of  Byzantine  faulty  clients,  given 
3/  -|-  1  servers.  As  will  be  discussed,  Zzyzx  allows  physical  servers  to  take  on  different  roles  in  the 
protocol,  namely  as  log  servers  or  state  machine  replicas.  A  log  server  and  replica  can  be  co-located 
on  a  single  physical  server,  or  each  can  be  supported  by  separate  physical  servers.  Regardless  of  the 
mapping  of  roles  to  physical  servers,  the  presentation  here  assumes  that  there  are  3/ -|- 1  log  servers, 
at  most  /  of  which  fail,  and  3/+  1  replicas,  at  most  /  of  which  fail. 

To  model  the  strong  guarantees  provided  by  Byzantine  fault  tolerance,  faulty  nodes  are  assumed 
to  be  controlled  and  coordinated  by  a  malicious  adversary.  The  adversary  also  controls  the  net¬ 
work,  deciding  if  and  when  messages  are  delivered  and  corrupting  messages  at  will.  Cryptographic 
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message  authentication  codes  (MACs)  and  signatures  are  used  to  ensure  the  integrity  of  messages 
between  correct  nodes. 

Of  course,  such  a  powerful  adversary  could  prevent  progress  by  refusing  to  deliver  messages. 
In  general,  a  replicated  state  machine  may  not  be  live  in  an  asynchronous  network  environment, 
even  if  only  a  single  benign  fault  might  occur  [40].  Zzyzx  implements  Byzantine  Locking  on 
top  of  Zyzzy va,  and  it  provides  the  same  liveness  guarantee  [55] :  Zzyzx  is  live  if  the  network  is 
eventually  synchronous  [37],  i.e.,  there  is  a  fixed  (though  potentially  unknown)  delay  bound  and 
some  (unknown)  point  in  time  after  which  all  messages  are  delivered  within  that  bound.  In  general, 
Byzantine  Locking  inherits  the  liveness  properties  of  the  underlying  protocol.  For  example,  an 
obstruction-free  Byzantine  Locking  protocol  could  be  built  on  top  of  Q/U  [2]. 

As  in  prior  protocols  [2,  24,  32,  55],  Zzyzx  satisfies  linearizabilify  [51]  from  fhe  perspecfive  of 
correcf  clienfs.  Linearizabilify  requires  fhaf  correcf  clienfs  issue  operations  sequentially,  leaving  af 
mosf  one  operafion  oufsfanding  af  a  fime,  as  assumed  in  fhis  chapfer,  buf  fhis  requiremenf  can  be 
relaxed.  Each  operafion  applies  fo  one  or  more  objects,  which  are  individual  componenfs  of  sfafe 
wifhin  fhe  sfafe  machine. 

Two  operafions  are  concurrent  if  each  operation’s  response  does  nof  precede  fhe  ofher’s  invoca¬ 
tion.  This  chapfer  makes  a  distinction  befween  concurrency  and  confenfion.  An  objecf  experiences 
contention  if  disfincf  clienfs  submif  concurrenf  requesfs  fo  fhe  objecf  or  inferleave  requesfs  fo  if 
(even  if  fhose  requesfs  are  nof  concurrenf).  For  example,  an  objecf  experiences  frequenf  confenfion 
if  fwo  clienfs  alfernafe  wrifing  fo  if.  Low  confenfion  can  be  characferized  by  long  contention-free 
runs  on  an  objecf,  comprised  of  mulfiple  operafions  on  fhe  objecf  by  a  single  clienf. 

If  is  precisely  such  conlenlion-free  runs  on  objecfs  for  which  Byzantine  Locking  is  beneficial, 
since  if  provides  exclusive  access  fo  fhose  objecfs  and  enables  an  optimized  profocol  fo  be  used  fo 
invoke  operafions  on  fhem.  As  such,  if  is  imporfanf  for  performance  fhaf  objecfs  be  defined  so  as  fo 
minimize  confenfion  for  fhem.  HQ  [32]  and  Q/U  [2]  divide  fhe  sfafe  machine  info  objecfs  for  similar 
reasons.  CBASE  [57]  divides  fhe  objecf  space  fo  exploif  requesf  parallelism  wifhin  a  single  sfafe 
machine.  Of  course,  fhe  granularify  depends  on  fhe  applicafion  and  fhe  workload.  In  fhe  exfreme,  if 
only  one  clienf  is  acfive  much  of  fhe  fime,  fhe  sfafe  need  nof  be  partitioned  af  all. 

Because  Byzanfine  Eocking  ensures  good  performance  in  faull-  and  conlenlion-free  runs,  fhe 
replicaled  sfafe  machine  profocol  design  can  focus  on  ofher  goals,  such  as  efficienfly  handling 
faulfs  [30]  or  confenfion  [24,  55],  or  even  in  simplifying  fhe  profocol.  Many  replicafion  protocols 
elecf  a  server  as  a  leader,  calling  if  fhe  primary  [24,  55],  coordinator  [72],  or  sequencer  [95].  For 
simplicily  and  concreleness,  fhis  chapfer  assumes  Byzanfine  Eocking  on  top  of  Zyzzyva,  so  cerlain 
aclivifies  can  be  relegafed  fo  fhe  primary  to  simplify  fhe  profocol.  Take  nole,  however,  fhaf  Byzan¬ 
fine  Eocking  is  nof  dependenf  on  a  primary-based  profocol,  buf  can  build  on  a  variely  of  underlying 
replicaled  sfafe  machine  protocols. 


5.3  Byzantine  Locking  and  Zzyzx 

This  seclion  describes  Byzanfine  Eocking  and  Zzyzx  af  a  high  level.  A  more  formal  Irealmenl  of 
Byzanfine  Eocking,  including  a  proof  of  correclness  and  liveness,  is  provided  in  Appendix  A. 

Byzanfine  Eocking  provides  a  clienf  an  efficienl  mechanism  fo  modify  replicaled  objecfs  by  pro¬ 
viding  fhe  clienf  temporary  exclusive  access  to  the  object.  A  client  that  holds  temporary  exclusive 
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Figure  5.3.3:  Zzyzx  components.  The  execution  ofZzyzx  can  be  divided  into  three  subprotocols, 
described  in  Section  5.3.  A)  If  a  client  has  not  locked  the  objects  needed  for  an  operation,  the  client 
uses  a  substrate  protocol  such  as  Zyzzyva  (Section  5.3.1).  B)  If  a  client  holds  locks  for  all  objects 
touched  by  an  operation,  the  client  uses  the  log  interface  (Section  5.3.2).  C)  If  a  client  tries  to 
access  an  object  for  which  another  client  holds  a  lock,  the  unlock  subprotocol  is  run  ( Section  5.3.3 ). 


access  to  an  object  is  said  to  have  locked  the  object.  Zzyzx  implements  Byzantine  Locking  on  top 
of  Zyzzyva  [55],  as  illustrated  in  Figure  5.3.3.  In  Zzyzx,  objects  are  unlocked  by  default.  At  first, 
each  client  sends  all  operations  through  the  Zyzzyva  interface  (Figure  5.3.3A).  Upon  realizing  that 
there  is  little  contention,  the  client  sends  a  request  through  Zyzzyva  to  lock  a  set  of  objects.  The 
Zyzzyva  interface  and  the  locking  operation  are  described  in  Section  5.3.1. 

For  subsequent  operations  that  touch  only  locked  objects,  the  client  uses  the  log  interface  (Fig¬ 
ure  5.3.3B).  The  excellent  performance  of  Zzyzx  derives  from  the  simplicity  of  the  log  interface, 
which  is  little  more  than  a  replicated  append-only  log.  To  issue  a  request,  a  client  increments  a 
sequence  number  and  sends  the  request  to  3/  -|-  1  log  servers,  which  may  or  may  not  be  physically 
co-located  with  the  Zyzzyva  replicas.  Each  log  server  appends  the  operation  to  its  per-client  request 
log  if  the  operation  is  in  order,  and  then  executes  the  operation  on  its  local  state  before  returning  a 
response  to  the  client.  If  2/  -|-  1  log  servers  provide  matching  responses,  the  operation  is  complete. 
The  log  interface  is  described  further  in  Section  5.3.2. 

If  another  client  attempts  to  access  a  locked  object  through  the  Zyzzyva  interface,  the  primary 
initiates  the  unlock  subprotocol  (Figure  5.3.3C).  The  primary  sends  a  message  to  each  log  server  to 
unlock  the  object.  The  log  servers  reach  agreement  on  their  state  using  the  Zyzzyva  interface,  mark 
the  object  as  unlocked,  and  copy  the  updated  object  back  into  the  Zyzzyva  replicas.  If  the  client 
that  locked  the  object  subsequently  attempts  to  access  the  object  through  the  log  interface,  the  log 
server  replies  with  an  error  code,  and  the  client  retries  its  request  through  the  Zyzzyva  interface. 
The  unlock  subprotocol  is  described  further  in  Section  5.3.3. 
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5.3.1  The  Zyzzyva  Interface  and  Locking 

In  Zzyzx,  each  client  maintains  a  list  of  locked  objects  that  is  the  client’s  current  best  guess  as  to 
which  objects  it  has  locked.  The  list  may  be  inaccurate  without  impacting  correctness.  Each  replica, 
including  the  primary,  maintains  a  special  state  machine  object  called  the  lock  table.  The  lock  table 
provides  an  authoritative  description  of  which  client,  if  any,  has  currently  locked  each  object.  The 
lock  table  also  provides  some  per-client  state,  including  a  variable  vs.  Each  object  is  initially  marked 
unlocked  in  the  lock  table,  vs  is  set  to  1  for  each  client,  and  each  client’s  list  of  its  locked  objects  is 
empty. 

Upon  invoking  an  operation  in  Zzyzx,  a  client  checks  if  any  object  touched  by  the  operation  is 
not  in  its  list  of  locked  objects,  in  which  case  the  client  uses  the  Zyzzyva  interface.  As  in  Zyzzyva, 
the  client  sends  its  request  to  the  primary  replica.  The  primary  checks  if  any  object  touched  by  the 
request  is  locked.  If  not,  the  primary  resumes  the  standard  Zyzzyva  protocol,  batching  requests  as 
needed  and  sending  ordering  messages  to  the  other  replicas  as  usual. 

If  an  object  touched  by  the  request  is  locked,  the  primary  initiates  the  unlock  subprotocol,  de¬ 
scribed  in  Section  5.3.3.  The  request  is  enqueued  until  all  touched  objects  are  unlocked.  Any 
subsequent  request  to  lock  an  object  touched  by  an  enqueued  request  is  enqueued  as  well  (or,  alter¬ 
natively,  denied).  As  objects  are  unlocked,  the  primary  dequeues  each  enqueued  request  for  which 
all  objects  touched  by  the  request  have  been  unlocked,  and  resumes  the  standard  Zyzzyva  protocol 
as  above. 

Note  that  a  client  can  participate  in  Zzyzx  using  only  the  Zyzzyva  protocol,  and  in  fact  does  not 
need  to  be  aware  of  the  locking  mechanism  at  all.  In  general,  a  replicated  state  machine  protocol 
can  be  upgraded  to  support  Byzantine  Eocking  without  affecting  legacy  clients. 

A  client  can  attempt  to  lock  its  working  set  to  improve  its  performance.  To  do  so,  the  client  sends 
a  lock  request  for  each  object  using  the  Zyzzyva  protocol.  The  replicas  evaluate  a  deterministic 
locking  policy  to  determine  whether  to  grant  the  lock.  If  granted,  the  client  adds  the  object  to  its 
list  of  locked  objects.  The  replicas  also  return  the  value  of  the  per-client  vs  variable,  which  is 
incremented  upon  unlock.  The  variable  vs  stands  for  view-stamp,  but  this  chapter  will  refer  to  it 
as  vs  to  prevent  confusion  with  the  view-stamp  in  PBET  and  Zyzzyva.  The  vs  variable  is  used  to 
synchronize  state  between  the  log  servers  and  Zyzzyva  replicas. 

If  there  is  little  concurrency  across  a  set  of  objects,  the  entire  set  can  be  locked  in  a  single 
operation.  Eor  example,  each  file  in  a  file  system  could  be  represented  by  an  object,  and  a  client’s 
entire  home  directory  subtree  could  be  locked  upon  login,  using  the  efficient  log  interface  for  nearly 
all  operations. 

The  Zzyzx  prototype  uses  a  simple  policy  to  decide  when  to  lock  an  object:  each  replica  counts 
how  often  a  single  client  accesses  an  object  without  contention.  The  client  sets  a  flag  in  its  request 
stating  that  it  would  like  to  lock  touched  objects.  If  the  count  reaches  a  threshold  and  the  flag  is 
set,  the  client  locks  the  object.  (The  evaluation  in  Section  5.5  uses  a  threshold  of  ten.)  Counters 
are  only  kept  for  recently  accessed  objects.  Every  few  thousand  operations,  each  counter  is  reset, 
which  allows  any  client  to  lock  the  object.  Reseting  the  counters  makes  locking  multiple  objects  as 
above  more  likely.  Thus,  a  client  will  only  lock  an  object  if  no  other  client  has  recently  accessed  the 
object. 
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Figure  5.3.4:  Basic  communication  pattern  of  Zzyzx  versus  Zyzzyva.  Operations  on  locked 
objects  in  Zzyzx  complete  in  a  single  round-trip  with  2/  +  1  log  servers.  Zyzzyva  requires  three 
message  delays,  if  all  3/+  1  replicas  are  responsive,  or  more  message  delays,  if  some  replicas  are 
unresponsive. 


5.3.2  The  Log  Interface 

Upon  invoking  an  operation  in  Zzyzx,  a  client  may  find  that  all  objects  touched  by  the  operation  are 
in  its  list  of  locked  objects,  in  which  case  the  client  uses  the  log  interface.  The  client  increments  its 
request  number,  which  is  a  local  counter  used  for  each  operation  issued  through  the  log  interface, 
and  builds  a  message  containing  the  request,  the  request  number,  and  the  vs.  It  then  computes  a 
MAC  of  the  message  for  each  log  server,  as  in  prior  protocols.  Unlike  prior  protocols,  the  client  then 
computes  another  MAC  of  the  message  and  the  first  set  of  MACS  for  each  log  server.  The  inner  MACS 
are  used  in  the  unlock  subprotocol.  The  client  then  sends  the  message  along  with  the  inner  MACS 
and  the  appropriate  outer  MAC  to  each  log  server.  Thus,  the  outer  MAC  is  a  standard  authenticator. 
The  layered  set  of  MACS  is  called  a  layered  authenticator. 

Upon  receiving  a  request,  each  log  server  verifies  fhe  oufer  MAC.  The  log  server  fhen  verifies 
fhaf  fhe  requesf  is  in  order  as  follows:  If  fhe  requesf  number  is  lower  fhan  fhe  mosf  recenf  requesf 
number  for  fhe  clienf,  fhe  requesf  is  a  duplicafe  and  is  ignored.  If  fhe  requesf  number  mafches 
fhe  mosf  recenf  number,  fhe  mosf  recenf  response  is  re-senf.  If  fhe  requesf  number  is  greafer  fhan 
fhe  nexf  in  sequence,  or  if  fhe  vs  value  is  greafer  fhan  fhe  log  server’s  value,  fhe  log  server  musf 
have  missed  a  requesf  so  if  inifiafes  sfafe  fransfer  (described  in  Secfion  5.4.1).  If  fhe  log  server  has 
promised  nof  fo  access  an  objecf  fouched  by  fhe  requesf  (since  fhe  objecf  is  in  fhe  process  of  being 
unlocked,  as  described  in  Section  5.3.3),  if  refurns  failure. 

If  fhe  requesf  number  is  nexf  in  sequence,  fhe  log  server  fries  fo  execufe  fhe  requesf.  If  lazily 
felches  objecfs  from  replicas  as  needed  by  invoking  fhe  Zyzzyva  inferface.  Of  course,  if  a  log  server 
is  co-locafed  wifh  a  Zyzzyva  replica,  poinfers  fo  objecfs  may  be  sufficienf.  If  felching  an  objecf  fails 
because  fhe  objecf  is  no  longer  locked  by  fhe  clienf,  fhe  log  server  refurns  failure.  Ofherwise,  fhe 
log  server  has  a  local  copy  of  each  objecf  fhaf  is  fouched  by  fhe  requesf.  If  execufes  fhe  requesf  on 
ifs  local  copy,  appends  fhe  requesf,  vs,  requesf  number,  and  fhe  inner  sef  of  MACS  fo  ifs  requesf  log, 
and  refurns  a  response.  Upon  receiving  2/+  1  non-failure  responses,  fhe  clienf  refurns  fhe  majorify 
response. 

If  some  log  server  refurns  failure,  fhe  clienf  sends  a  special  retry  request  fhrough  fhe  Zyzzyva 
inferface,  which  includes  bofh  fhe  requesf  and  fhe  requesf  number.  Each  replica  checks  if  fhe  requesf 
complefed  af  fhe  log  servers  before  fhe  lasf  execufion  of  fhe  unlock  subprofocol,  in  which  case  fhe 
replicas  fell  fhe  clienf  fo  waif  for  a  response  from  a  log  server.  Ofherwise,  fhe  replicas  execufe  fhe 
requesf  as  a  normal  Zyzzyva  requesf. 
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Figure  5.3.5:  Unlock  message  diagram.  A)  In  the  absence  of  faults  and  concurrency,  the  fast 
unlock  subprotocol  is  executed  (Section  5.3.3).  The  primary  fetches  a  hash  of  the  request  log  at 
2/  +  1  log  servers  (labeled  “Try  Unlock”).  If  hashes  match,  the  primary  sends  the  hash  values, 
which  unlock  the  object,  and  the  conflicting  request  through  Zyzzyva  in  a  batch  (“Issue  request”). 
B)  Otherwise,  the  full  unlock  subprotocol  is  executed  (Section  5.3.3).  The  primary  fetches  request 
logs  from  2/  +  1  log  servers  (“Break  lock”).  It  then  asks  each  log  server  to  validate  the  inner  client 
MACS  in  the  request  logs  (“Validate  logs”).  Log  servers  agree  on  the  longest  valid  request  log  using 
Zyzzyva  (“Unlock”),  and  they  replay  that  request  log  to  reach  a  consistent  state.  Finally,  as  above, 
the  log  servers  send  the  primary  matching  hashes,  which  the  primary  sends  with  the  conflicting 
request  through  Zyzzyva  (“Issue  request”). 


Figure  5.3.4  shows  the  basic  communication  pattern  of  the  log  interface  in  Zzyzx  versus 
Zyzzyva.  Zyzzyva  requires  50%  more  network  hops  than  Zzyzx,  and  Zyzzyva  requires  all  3/  +  1 
servers  to  be  responsive  to  perform  well,  /  more  than  the  2/  +  1  responsive  servers  that  Zzyzx  re¬ 
quires.  Zzyzx  improves  upon  Zyzzyva  further,  though,  by  removing  the  bottleneck  primary  and  re¬ 
quiring  less  cryptography  at  servers.  The  latter  improvement  obviates  the  need  for  batching,  improv¬ 
ing  latency  without  cost  to  throughput.  Batching  is  a  technique  used  in  previous  protocols  [21,  55] 
where  the  primary  accumulates  a  batch  of  requests  before  sending  them  to  other  replicas.  Batch¬ 
ing  allows  the  cryptographic  overhead  of  the  agreement  subprotocol  to  be  amortized  over  many 
requests,  but  waiting  for  a  batch  of  requests  before  execution  can  increase  latency.  Because  Byzan¬ 
tine  Locking  provides  clients  temporary  exclusive  access  to  objects,  each  client  can  order  its  own 
requests  for  locked  objects,  avoiding  the  need  for  an  agreement  or  even  a  speculative  agreement  [55] 
subprotocol. 


5.3.3  Handling  Contention 

The  protocol,  as  described  so  far,  is  a  simple  combination  of  operations  issued  to  Zyzzyva  (Sec¬ 
tion  5.3.1)  and  requests  appended  to  a  log  (Section  5.3.2).  The  magic  of  Byzantine  Locking  is  found 
in  the  unlock  subprotocol,  which  differentiates  Byzantine  Locking  from  prior  lease-  and  lock-like 
mechanisms  found  in  systems  such  as  Farsi te  [4]  and  Chubby  [16]. 

A  client  that  does  not  hold  a  lock  on  an  object  uses  the  Zyzzyva  interface,  as  described  in 
Section  5.3.1.  Similarly,  a  client  that  receives  a  failure  response  from  a  log  server  retries  that 
request  through  the  Zyzzyva  interface,  as  described  in  Section  5.3.2.  Either  way,  the  client  sends  its 
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request  to  the  primary,  which  checks  if  any  objects  touched  by  the  request  are  locked,  as  described 
in  Section  5.3.1.  If  an  object  is  locked,  the  primary  initiates  the  unlock  subprotocol,  described 
in  this  section.  Though  this  section  describes  unlocking  a  single  object,  implementations  can,  of 
course,  unlock  multiple  objects  in  a  single  execution  of  the  unlock  subprotocol.  The  primary  is 
well-positioned  to  initiate  the  unlock  subprotocol,  because  it  knows  which  objects  are  locked  and  it 
controls  the  order  in  which  operations  are  issued. 

The  unlock  subprotocol  consists  of  a  fast  path  and  a  slower  path,  described  below  in  Sec¬ 
tions  5.3.3  and  5.3.3,  respectively.  Figure  5.3.5  shows  the  communication  pattern  for  both  types 
of  unlock.  As  shown  in  Figure  5.3.5A,  the  fast  unlock  is  quite  efficient,  requiring  just  a  single 
round-trip  between  the  primary  and  2f  +  1  log  servers.  The  full  unlock  protocol,  shown  in  Fig¬ 
ure  5.3.5B,  requires  additional  communication,  but  is  required  only  when  a  client  or  log  server  is 
faulty,  or  when  request  logs  do  not  match. 


Fast  Unlock 

In  the  fast  unlock  path,  shown  in  Figure  5.3.5A,  the  primary  sends  a  “Try  unlock”  message  to  each 
log  server,  describing  the  object  (or  set  of  objects)  being  unlocked.  Each  log  server  constructs  a 
message  containing  the  hash  of  its  request  log  and  the  hash  of  the  current  value  of  the  object.  A 
designated  replier  includes  the  value  of  the  object  in  its  message  (as  in  replies  for  PBFT  [24]).  Once 
again,  if  log  servers  are  co-located  with  Zyzzyva  replicas,  only  a  pointer  to  the  object  may  need  to 
be  sent.  Each  log  server  sends  its  response  to  the  primary  formatted  as  a  Zyzzyva  request.  Unlike 
other  Zyzzyva  requests,  however,  the  log  server  does  not  wait  for  a  response,  but  rather  the  primary 
resends  its  “Try  unlock”  message  until  enough  log  servers  provide  responses. 

Upon  receiving  2/  -|-  1  responses  with  matching  object  and  request  log  hashes  and  at  least  one 
object  that  matches  the  hashes,  the  primary  sends  the  responses  through  the  standard  Zyzzyva  proto¬ 
col,  batched  with  any  requests  enqueued  due  to  object  conflicts  (see  Section  5.3.1).  Before  sending 
a  response  to  the  primary,  each  log  server  adds  the  object  to  a  list  of  objects  it  promises  not  to  touch 
until  the  next  instantiation  of  the  full  unlock  subprotocol,  described  in  Section  5.3.3  below. 

Making  matching  logs  more  likely:  The  fast  path  requires  that  request  logs  match,  which 
may  not  be  the  case  if  the  messages  for  a  request  from  the  client  and  the  messages  for  a  “Try 
unlock”  from  the  primary  arrive  in  a  different  order  at  different  log  servers,  such  that  the  primary’s 
message  precedes  the  client’s  request  on  some  log  servers  but  vice-versa  on  others.  Eortunately, 
concurrent  requests  are  less  common  than  in  other  protocols,  because  only  two  nodes  are  involved 
(a  single  client  and  the  primary).  The  window  of  vulnerability  for  interleaving  is  the  jitter  between 
client  requests  divided  by  the  round-trip  time  (the  client  has  at  most  one  outstanding  request).  This 
window  is  similar  to  the  one  in  Q/U  [2],  and  better  than  the  one  in  HQ  [32]  (HQ’s  window  spans 
two  write  phases).  Eurthermore,  depending  on  network  topology,  multicast  may  enforce  a  de  facto 
ordering,  allowing  concurrency  only  if  a  packet  is  dropped  between  the  last  switch  before  the  log 
servers. 

To  further  increase  the  likelihood  that  request  logs  match,  each  log  server  can  send  a  hash  for  the 
request  log  up  to  the  most  recent  request  that  touched  the  object  (or  set  of  objects)  being  unlocked. 
Thus,  a  client  request  concurrent  with  an  unlock  operation  only  forces  a  full  unlock  if  an  object  is 
needed  by  both  the  request  and  the  unlock. 
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The  Full  Unlock  Subprotocol 

If  the  request  logs  do  not  match  in  “Try  unlock”  from  the  fast  path,  the  full  unlock  subprotocol 
(Figure  5.3.5B)  must  be  initiated.  The  primary  fetches  signed  request  logs  from  2/+  1  log  servers. 
(Signatures  can  be  avoided  using  standard  techniques,  but  full  unlock  is  rare,  so  signatures  are  used 
to  simplify  the  protocol  description.)  Before  sending  its  request  log,  a  log  server  adds  the  object 
(or  set  of  objects)  being  unlocked  to  its  list  of  objects  that  it  promises  not  to  touch  until  the  next 
full  unlock,  as  in  a  fast  unlock.  The  primary  then  sends  these  request  logs  to  each  log  server,  which 
validates  its  MAC  from  the  inner  layer  of  MACS  stored  with  each  request  in  the  request  log  (see 
Section  5.3.2).  Log  validation  can  be  skipped  if  the  longest  request  log  matches  at  /  +  1  log  servers. 

The  log  servers  then  vote  on  which  request  logs  contain  valid  MACS  by  issuing  a  Zyzzyva  re¬ 
quest.  The  longest  request  log  for  which  /  +  1  log  servers  found  valid  MACS  is  returned  to  each  log 
server,  which  then  replays  the  log  as  needed  to  reach  a  consistent  state  that  matches  the  state  at  other 
correct  log  servers.  The  log  server  then  marks  the  object  being  unlocked  as  unlocked,  increments 
vs,  and  clears  the  list  of  objects  it  promised  not  to  touch  until  the  next  full  unlock.  Finally,  as  in 
fast  unlock  in  Section  5.3.3,  correct  log  servers  send  the  primary  matching  hash  values  describing 
their  state  and  the  object  to  be  unlocked.  The  primary  sends  these  hash  values,  along  with  any  re¬ 
quests  enqueued  due  to  object  conflicts  (see  Section  5.3.1),  in  a  batch  through  the  standard  Zyzzyva 
protocol. 

Figure  5.3.5B  illustrates  a  few  optimizations  over  the  complete  pseudo-code  for  the  full  unlock 
subprotocol.  In  particular,  several  sequential  Zyzzyva  requests  can  be  issued  in  parallel,  as  shown 
in  the  “Unlock”  phase  in  Figure  5.3.5B.  The  full  unlock  subprotocol  is  similar  to  view  change  in 
PBFT  or  Zyzzyva,  and  is  not  needed  in  fault-  and  concurrency-free  executions. 


5.4  Protocol  Details 

The  log  servers  use  checkpointing  and  state  transfer  mechanisms,  similar  to  mechanisms  found  in 
PBFT  [24],  HQ  [32],  and  Zyzzyva  [55],  described  in  Section  5.4.1.  As  in  Q/U  [2]  and  HQ  [32], 
Zzyzx  takes  advantage  of  preferred  quorums.  Section  5.4.2  describes  optimizations  for  read-only 
requests,  more  aggressive  locking,  lower  contention,  and  preferred  quorums.  Zzyzx  can  provide 
near-linear  scalability  by  deploying  additional  replicas,  discussed  in  Section  5.4.3.  Such  scalability 
is  unprecedented — the  throughput  of  most  Byzantine  fault-tolerant  protocols  cannot  be  increased 
by  adding  additional  replicas,  because  all  requests  flow  through  a  bottleneck  node  (e.g.,  the  pri¬ 
mary  in  PBFT  [24]  and  Zyzzyva  [55])  or  overlapping  quorums  (which  provides  limited  scalability). 
Appendix  B  provides  further  details. 

Though  this  paper  assumes  that  at  most  /  servers  (log  servers  or  replicas)  fail,  Byzantine  Lock¬ 
ing  (and  many  other  Byzantine  fault-tolerant  protocols)  can  support  a  hybrid  failure  model  that 
allows  for  different  classes  of  failures.  As  in  Q/U  [2] ,  suppose  that  at  least  n  —  t  servers  are  correct, 
and  that  at  least  n  —  b  are  honest,  i.e.,  either  correct  or  fail  only  by  crashing;  as  such,  t  >b.  Then, 
the  total  number  of  servers  is  b  +  2t  +  \  rather  than  3/  -I-  1 ,  and  the  quorum  size  is  /?  -|- 1  -|-  1  rather 
than  2f  +  \.  Of  course,  when  f  =  b  =  t,iiis  the  case  that  b  +  2t  -\-\  =  'if  +\  and  b  +  t  +  \  =  2f  +\. 
The  benefit  of  such  a  hybrid  model  is  that  one  additional  server  can  provide  the  benefits  of  Byzan¬ 
tine  fault-tolerance.  (Or,  more  generally,  b  additional  servers  can  tolerate  b  simultaneous  Byzantine 
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faults.)  A  hybrid  model  suits  deployments  where  arbitrary  faults,  such  as  faults  due  to  soft  errors, 
are  less  common  than  crash  faults. 

5.4.1  Checkpointing  and  State  Transfer 

Log  servers  should  checkpoint  their  state  periodically  both  to  truncate  their  request  logs  and  to  limit 
the  amount  of  work  needed  for  a  full  unlock.  The  full  unlock  operation  acts  as  a  checkpointing 
mechanism,  because  log  servers  reach  an  agreed-upon  state.  Thus,  upon  full  unlock,  requests  prior 
to  the  unlock  can  be  purged.  The  simplest  checkpoint  protocol  is  for  log  servers  to  execute  the  full 
unlock  subprotocol  for  a  null  object  at  fixed  intervals.  Zzyzx  can  also  use  standard  checkpointing 
techniques  found  in  Zyzzyva  [56]  and  similar  protocols,  which  may  be  more  efficient.^ 

If  a  correct  client  sends  a  request  number  greater  than  the  next  request  number  in  order,  the  log 
server  must  have  missed  a  request.  The  log  server  sends  a  message  to  all  3/  other  log  servers,  asking 
for  missed  requests.  Upon  receiving  matching  values  for  the  missing  requests  and  the  associated 
MACS  from  /  +  1  log  servers,  the  log  server  replays  the  missed  requests  on  its  local  state  to  catch 
up.  Since  2/  +  1  log  servers  must  have  responded  to  each  of  the  client’s  previous  requests,  at  least 
/  +  1  correct  log  servers  must  have  these  requests  in  their  request  logs.  A  log  server  may  substitute 
a  stable  checkpoint  in  place  of  prior  requests. 

5.4.2  Optimizations 

Read-only  requests:  A  client  can  read  objects  locked  by  another  client  if  all  3/+  1  log  servers 
return  the  same  value,  as  in  Zyzzyva.  Thus,  read-only  operations  perform  no  worse.  If  2/  -|-  1  log 
servers  return  the  same  value  and  the  object  was  not  modified  since  fhe  lasf  checkpoinf,  fhe  clienf 
can  refurn  thaf  value.  If  the  object  was  modified,  the  client  can  force  a  checkpoint,  which  may  be 
less  expensive  than  the  unlock  subprotocol. 

Aggressive  locking:  If  an  object  is  locked  but  never  fetched  by  log  servers  through  the  Zyzzyva 
interface,  there  is  no  need  to  run  the  unlock  subprotocol  under  contention.  The  primary  can  just 
send  the  conflicting  request  through  the  standard  Zyzzyva  protocol,  which  will  deny  future  fetch 
requests  pertaining  to  the  previous  lock.  Thus,  aggressively  locking  a  large  set  of  objects  does  not 
lower  performance. 

Pre-serialization:  Section  5.5.7  finds  that  Zzyzx  outperforms  Zyzzyva  for  contention-free  runs 
as  short  as  ten  operations.  The  pre-serializer  technique  of  Singh  et  al.  [95]  could  make  the  break¬ 
even  point  even  lower. 

Preferred  quorums:  Rather  than  send  requests  to  all  3/-|-  1  log  servers  for  every  operation,  a 
client  can  send  requests  to  2/-|-  1  log  servers  if  all  2/-|-  1  servers  provide  matching  responses.  This 
optimization  limits  the  amount  of  data  sent  over  the  network,  which  is  useful  when  the  network  is 
bandwidth-  or  packet-limited,  or  when  the  remaining  /  replicas  are  slow.  It  also  frees  /  servers  to 

*  Such  protocols  operate  as  follows.  A  log  server  computes  a  tentative  checkpoint  at  fixed  request  intervals,  which 
consists  of  the  log  server’s  current  state.  Each  log  server  sends  a  MACed  hash  of  its  tentative  checkpoint  to  every  other 
log  server.  Upon  accumulating  2/4-1  such  messages  that  match,  a  log  server  sends  a  checkpoint  message  to  every  other 
log  server,  which  includes  the  hash  from  the  tentative  checkpoint.  Upon  accumulating  2/  -f  1  checkpoint  messages,  the 
checkpoint  is  stable  and  prior  state  can  be  purged. 
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process  other  tasks  or  operations  in  the  common  case,  thus  allowing  a  factor  of  up  to  higher 
throughput. 

Performance  under  attack;  Byzantine  fault-tolerant  prototypes  often  perform  poorly,  if  at  all, 
in  malicious  environments  [30].  The  protocol  description  in  Section  5.3  does  not  address  perfor¬ 
mance  under  attack  in  the  interest  of  clarity  and  simplicity,  but  standard  techniques  to  detect,  isolate, 
and  mitigate  faulty  behavior  can  be  applied  (e.g.,  [30,  46],  [21,  Section  3.2.2]). 

5.4.3  Scalability  Through  Log  Server  Groups 

There  is  nothing  in  Section  5.3  that  requires  the  group  of  replicas  used  in  the  Byzantine  fault-tolerant 
replicated  state  machine  protocol  to  be  hosted  on  the  same  servers  as  the  log  servers.  Thus,  a  system 
can  deploy  replicas  and  log  servers  on  distinct  servers.  Similarly,  the  protocol  can  use  multiple 
distinct  groups  of  log  servers.  An  operation  that  spans  multiple  log  server  groups  can  always  be 
completed  through  the  Zyzzyva  interface.  The  benefit  of  multiple  log  server  groups  is  near  linear 
scalability  in  the  number  of  servers,  which  far  exceeds  the  scalability  that  can  be  achieved  by  adding 
servers  in  prior  protocols. 

For  example,  given  4/  -|-  2  replicas  and  assuming  cross-group  operations  are  rare,  the  replicas 
can  be  divided  into  two  log  server  groups  with  non-overlapping  preferred  quorums  of  size  2/  -|-  1 . 
Thus,  the  protocol  can  scale  by  roughly  a  factor  of  2x  given  an  additional  f  +  I  servers.  The  first 
2/  +  1  servers  would  comprise  the  preferred  quorum  of  the  first  log  server  group,  and  the  second 
2/-|- 1  servers  would  comprise  the  preferred  quorum  of  a  second  log  server  group.  The  state  machine 
replicas  would  be  hosted  on  any  subset  of  3/  -|-  1  servers. 

Because  most  operations  would  touch  only  half  of  the  servers,  the  achievable  system  throughput 
would  be  double  that  of  a  typical  implementation.  Furthermore,  this  technique  may  allow  log  servers 
in  a  wide-area  distributed  system  to  be  located  geographically  closer  to  clients.  For  example,  2/-|-  1 
log  servers  could  be  deployed  in  each  of  the  California,  Massachusetts,  Texas,  and  Washington 
offices  of  a  company.  Mosf  operafions  would  operate  efficienfly  on  local  copies  of  fhe  dafa  in 
requesf  logs.  Operafions  fhat  spanned  sites  could  be  complefed  fhrough  fhe  replicafed  sfafe  machine 
profocol  run  over  3/  -I-  1  replicas. 

An  application  can  benefil  from  mulfiple  log  server  groups  only  if  many  operafions  execute 
on  disjoinf  subsefs  of  objecfs  and  if  objecfs  do  nof  need  fo  be  fransferred  often.  Though  nof  all 
applications  can  benefit,  several  important  applications  in  most  need  of  Byzantine  fault  tolerance 
meet  these  requirements.  For  example,  many  distributed  storage  and  database  applications  are  em¬ 
barrassingly  parallelizable.  In  fact,  Zzyzx  was  designed  in  the  course  of  architecting  large-scale, 
high-performance  Byzantine  fault-tolerant  storage  systems  [1,  48].  In  such  a  system,  Zzyzx  would 
manage  metadata,  and  data  would  be  accessed  thr'ough  a  Byzantine  fault-tolerant  block  storage 
protocol  (e.g.,  [18,  44,  48]). 


5.5  Evaluation 

This  section  evaluates  the  performance  of  Zzyzx  and  compares  it  with  that  of  Zyzzyva  and  that  of 
an  unreplicated  server.  Zzyzx  is  implemented  as  a  module  on  top  of  Zyzzyva,  which  in  turn  is  a 
modified  version  of  fhe  PBFT  library  [24].  MD5  was  replaced  wifh  SHAl  in  Zyzzyva  and  Zzyzx, 
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because  MD5  is  no  longer  considered  secure  [104].  (Zyzzyva  also  uses  AdHash,  which  is  known  to 
be  insecure  [103].)  Zyzzyva  is  the  only  other  Byzantine  fault-tolerant  protocol  measured,  because 
it  outperforms  prior  Byzantine  fault-tolerant  protocols  [55,  96]. 

Because  Zyzzyva  does  not  utilize  Byzantine  Locking,  replicas  must  agree  on  the  order  of  re¬ 
quests  before  a  response  is  returned  to  the  client  (rather  than  the  client  ordering  requests  on  locked 
objects).  Order  agreement  requires  that  the  primary  generates  MACS  for  each  of  the  3/4-1  other 
servers,  which  would  prove  expensive  if  done  for  every  operation.  Thus,  the  primary  in  Zyzzyva 
(like  in  PBFT)  orders  multiple  operations  at  once.  The  primary  accumulates  a  batch  of  size  B  of 
requests.  It  then  orders  all  B  operations  and  computes  a  single  set  of  MACS,  amortizing  the  cryp¬ 
tographic  cost  over  several  requests.  Though  important  for  high  throughput  in  Zyzzyva,  batching 
increases  latency  in  Zyzzyva.  This  section  considers  Zyzzyva  using  batch  sizes  of  one  (B=l)  and 
ten  (B=10). 

5.5.1  Assumptions  and  Limitations 

Zzyzx  always  ensures  correctness,  but  it  was  not  designed  to  perform  well  in  a  malicious  environ¬ 
ment.  There  are  two  reasons  for  this  design  choice.  First,  in  many  environments,  corruptions  are 
relatively  rare.  Thus,  performance  only  matters  in  the  common  case,  when  there  are  no  corruptions. 
Second,  the  design  did  not  consider  performance  in  malicious  environments  to  avoid  additional 
complexity.  Most  Byzantine  fault-tolerant  replicated  state  machine  protocols  perform  poorly  in  ma¬ 
licious  environments,  but  some  recent  proposals  introduce  techniques  that  Zzyzx  could  adopt  to 
ensure  reasonable  performance  even  in  the  presence  of  malicious  attackers  [7,  30]. 

Zzyzx  requires  that  the  state  machine  can  be  partitioned  into  objects.  As  discussed  in  Sec¬ 
tion  5.1,  many  other  protocols  impose  similar  requirements,  and  many  important  systems  are 
amenable  to  partitioning  into  objects.  Furthermore,  the  scalability  of  Zzyzx  depends  upon  the  abil¬ 
ity  to  partition  the  workload  into  log  server  groups.  Fortunately,  as  discussed  in  Section  5.4.3,  many 
important  workloads,  such  as  distributed  storage  workloads,  are  easily  partitionable. 

5.5.2  Experimental  Setup 

All  experiments  are  performed  on  a  set  of  computers  that  each  have  a  3.0  GHz  Intel  Xeon  processor, 
2  gigabytes  of  memory,  and  an  Intel  PRO/1000  network  card.  All  computers  are  connected  to  a  HP 
ProCurve  Switch  2848,  which  has  a  specified  internal  bandwidth  of  96  Gbps  (69.3  Mpps).  Each 
computer  runs  Linux  kernel  2.6.28-7  (the  most  recent  at  the  time  of  writing)  with  default  networking 
parameters.  Experiments  use  the  Zyzzyva  code  released  by  the  protocol’s  authors  [55],  configured 
fo  use  all  opfimizafions  [55,  56].  Bofh  Zyzzyva  and  Zzyzx  use  UDP  mulficasf.  After  accounfing  for 
fhe  performance  difference  befween  SHAl  and  MD5,  fhe  evaluafion  of  Zyzzyva  agrees  wifh  fhaf  of 
Kolia  ef  al.  [55]. 

A  Zyzzyva  replica  process  runs  on  each  of  3/  4-  1  server  computers.  Eor  Zzyzx,  excepf  where 
nofed,  one  Zyzzyva  replica  process  and  one  log  server  process  run  on  each  of  3/  4-  1  server  com- 
pufers.  Zzyzx  is  measured  bofh  wifh  fhe  preferred  quorum  opfimizalion  of  Seclion  5.4.2  enabled 
(labeled  “Zzyzx  ”)  and  wifh  preferred  quorums  disabled  (labeled  “Zzyzx-noPQ”). 

The  micro-benchmark  workload  used  in  Sections  5.5.3  fhrough  5.5.5  consisfs  of  each  clienf 
process  performing  a  null  requesf  and  receiving  a  null  reply.  This  workload  is  fhe  same  as  fhe 
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Figure  5.5.6:  Throughput  vs.  client  processes  when  /  =  1  and  all  servers  are  responsive. 


workload  measured  by  Kotla  et  al.  [55],  which  is  based  on  the  0/0  micro-benchmark  of  Castro  and 
Liskov  [24].  Each  client  accesses  an  independent  set  of  objects,  avoiding  contention.  A  client  run¬ 
ning  Zzyzx  locks  each  object  on  first  access.  The  workload  is  meant  to  highlight  the  overhead  found 
in  each  protocol,  as  well  as  to  provide  a  basis  for  comparison  by  reproducing  prior  experiments. 

Each  physical  client  computer  runs  10  instances  of  the  client  process.  This  number  was  chosen 
so  that  the  client  computer  does  not  become  processor-bound.  All  experiments  are  run  for  90  sec¬ 
onds,  with  measurements  taken  from  the  middle  60  seconds.  The  mean  of  at  least  3  runs  is  reported, 
and  the  standard  deviation  for  all  results  is  within  3%  of  the  mean. 

5.5.3  Scalability 

Eigure  5.0.1  on  page  58  shows  the  throughput  of  Zzyzx  and  Zyzzyva  as  the  number  of  servers 
increases  when  tolerating  one  fault.  The  first  3/  -|-  1  log  servers  are  co-located  with  the  Zyzzyva 
replicas.  Additional  log  servers  run  on  dedicated  computers.  Unlike  prior  protocols,  including 
Zyzzyva,  the  performance  of  Zzyzx  improves  as  more  servers  are  utilized.  Since  only  2/  -|-  1  log 
servers  are  involved  in  each  operation  and  independent  log  server  sets  do  not  need  to  overlap,  the 
increase  in  usable  quorums  results  in  nearly  linear  scalability.  Zyzzyva  and  previous  protocols  do 
not  support  scaling  in  this  fashion.  Although  data  is  only  shown  for  /  =  1 ,  the  general  shape  of  the 
curve  applies  when  tolerating  more  faults. 

5.5.4  Throughput 

Eigure  5.5.6  shows  the  throughput  achieved,  while  varying  the  number  of  clients,  when  tolerating  a 
single  fault  and  when  all  servers  are  correct  and  responsive.  Throughput  is  not  noted  for  B=10  when 
using  fewer  than  10  clients.  Zzyzx  significantly  outperforms  all  Zyzzyva  conhgurations.  Zzyzx’s 
maximum  throughput  is  2.9  x  that  of  Zyzzyva  with  B=10,  and  higher  still  compared  to  Zyzzyva 
without  batching.  When  Zzyzx  is  run  on  f+l  additional  servers  (6  total),  it’s  maximum  throughput 
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Figure  5.5.7:  Latency  vs.  throughput  when  /  =  1  and  all  servers  are  responsive. 


is  3.9  X  that  of  Zyzzyva  with  B=10.  Even  without  preferred  quorums,  Zzyzx’s  maximum  throughput 
is  2.2  X  that  of  Zyzzyva  with  B=10,  due  to  Zzyzx’s  lower  network  and  cryptographic  overhead. 

Due  to  the  preferred  quorums  optimization,  Zzyzx  provides  higher  maximum  throughput  than 
the  unreplicated  server,  which  simply  generates  and  verifies  a  single  MAC,  because  each  log  server 
processes  only  a  fraction  (|^)  of  the  requests.  With  preferred  quorums  disabled  (ZZYZX-NOPQ), 
Zzyzx  provides  lower  throughput  than  the  unreplicated  server  due  to  checkpoint,  request  log,  and 
network  overheads. 


5.5.5  Latency 

Figure  5.5.7  shows  the  average  latency  for  a  single  operation  as  the  applied  load  is  varied.  When 
serving  a  single  request,  Zzyzx  exhibits  39-43%  lower  latency  than  Zyzzyva,  and  Zzyzx  continues 
to  provide  lower  latency  as  load  increases.  The  lower  latency  is  because  Zzyzx  requires  only  2  one¬ 
way  message  delays  (33%  fewer  than  Zyzzyva),  each  server  computes  fewer  MACS,  and  log  servers 
in  Zzyzx  never  wait  for  a  batch  of  requests  to  accumulate  before  executing  the  request  and  returning 
a  response. 

Figure  5.5.7  also  illustrates  the  problem  with  batching  in  Zyzzyva.  Unless  replicas  are  satu¬ 
rated,  Zyzzyva  provides  lower  latency  when  batching  is  disabled,  but  batching  allows  substantially 
higher  maximum  throughput.^  Whereas  batching  forces  a  choice  between  low  latency  and  a  high 
throughput  in  Zyzzyva,  Zzyzx  can  provide  both  low  latency  and  high  throughput. 


^PBFT  uses  feedback  from  subsequent  protocol  phases  to  tune  the  batch  window,  avoiding  the  latency  increase  seen  in 
Zyzzyva.  Zyzzyva  eliminates  subsequent  protocol  phases  from  PBFT  and  so  cannot  use  this  technique.  But,  Zyzzyva  still 
exhibits  lower  latency  than  PBFT  because  requests  complete  after  fewer  message  delays  and  require  less  cryptography. 
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Figure  5.5.8:  Throughput  vs.  client  processes  when  /  =  1  and  one  server  is  unresponsive. 

5.5.6  Performance  with  /  Slow  Servers 

Figure  5.5.8  shows  the  throughput  achieved  while  varying  the  number  of  clients  when  configured  to 
tolerate  a  single  fault  and  when  a  single  server  is  unresponsive  though  not  malicious  (e.g.,  crashed). 
The  performance  of  an  unreplicated  server  is  included  for  comparison. 

Zzyzx  throughput  decreases  approximately  33%,  to  the  unaffected  level  of  Zzyzx-noPQ,  be¬ 
cause  all  requests  now  use  the  same  2/-I-1  servers.  Once  clients  detect  that  a  server  in  a  preferred 
quorum  of  the  log  server  group  is  unresponsive,  they  send  requests  to  all  servers  in  the  log  server 
group  and  stay  on  the  fast  path.  With  or  without  preferred  quorums,  when  one  server  is  unrespon¬ 
sive,  Zzyzx  provides  55%  higher  throughput  than  Zyzzyva. 

Figure  5.5.8  for  Zyzzyva  with  B=10  reports  substantially  better  throughput  than  reported  by 
Kotla  et  al.  at  SOSP  [55],  because  the  released  Zyzzyva  code  uses  the  “commit  optimization”  de¬ 
scribed  in  their  extended  technical  report  [56].  The  commit  optimization,  however,  requires  an  extra 
message  delay  in  the  form  of  an  all-to-all  round  of  communication.  Due  to  this  all-to-all  commu¬ 
nication,  Zyzzyva  with  the  commit  optimization  may  not  be  as  “fault  scalable”  [2]  as  the  Zyzzyva 
protocol  reported  by  Kotla  et  al.  at  SOSP  [55]. 

5.5.7  Performance  Under  Contention 

Figure  5.5.9  shows  the  performance  of  Zzyzx  under  contention.  For  this  workload,  each  client 
accesses  an  object  a  fixed  number  of  fimes  before  fhe  objecf  is  unlocked.  The  clienf  fhen  procures  a 
new  lock  and  resumes  accessing  fhe  objecf.  The  experimenf  is  designed  fo  idenfify  fhe  break-even 
poinf  of  Zzyzx,  which  is  fhe  lengfh  of  fhe  shorfesf  confenfion-free  run  for  which  Zzyzx  oufperforms 
Zyzzyva. 

When  bafching  in  Zyzzyva  is  disabled  fo  improve  lafency,  Zzyzx  oufperforms  Zyzzyva  for 
confenfion-free  runs  fhaf  average  10  or  more  operations.  Zzyzx  oufperforms  Zyzzyva  when  bafch¬ 
ing  is  enabled  (B=10)  for  confenfion-free  runs  fhaf  average  20  or  more  operafions.  Zzyzx  achieves 
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Operations  between  contention 


Figure  5.5.9:  Throughput  vs.  number  of  consecutive  contention-free  ops  when  /  =  1  and  all 
servers  are  responsive.  The  horizontal  lines  show  the  throughput  of  Zyzzyva  and  Zzyzx  in  the 
absence  of  contention. 

85-90%  of  its  maximum  throughput  for  contention-free  runs  of  160  operations — as  noted  in  Sec¬ 
tion  5.1.1,  contention-free  runs  often  average  in  the  thousands. 

5.5.8  Postmark  and  Trace-driven  Execution 

Figure  5.5.10  compares  the  execution  of  Zzyzx  and  Zyzzyva  on  a  file  system  workload.  Zzyzx 
outperforms  prior  replicated  state  machines  for  both  data  and  metadata  due  to  its  minimal  com¬ 
munication,  but  custom  block  transport  protocols  can  use  erasure  coding  to  outperform  replicated 
state  machines  [44,  48].  Thus,  this  evaluation  focuses  on  difference  in  the  performance  between 
protocols  when  managing  the  metadata  of  a  distributed  file  system. 

To  test  a  metadata  workload,  a  memory-backed  file  system  using  FUSE  was  implemented 
that  interfaces  with  Zyzzyva,  Zzyzx,  and  an  unreplicated  server  for  metadata  operations.  Zzyzx 
completed  60%  more  transactions  per  second  (TPS)  than  Zyzzyva  in  the  default  Postmai'k  bench¬ 
mark  [52],  where  a  transaction  may  consist  of  multiple  Zzyzx  or  Zyzzyva  operations  plus  some 
processing  time.  Zzyzx  completes  22%  fewer  transactions  per  second  than  the  unreplicated  server. 
Postmark  produces  a  workload  with  many  small  files,  similar  to  the  workload  found  in  a  mail  server. 
Postmark  performance  depends  primarily  upon  request  response  time. 

Metadata  operations  were  extracted  from  NFS  traces  of  a  large  departmental  server.  Over 
14  million  metadata  operations  were  considered  from  the  Harvard  EECS  workload  between  Mon¬ 
day  17-Eriday  21  of  Eebruary  2003  [38].  A  matching  workload  mix  was  then  executed  on  Zyzzyva 
and  Zzyzx.  Zzyzx  used  the  lock  policy  described  in  Section  5.3.1  to  determine  when  to  lock  an 
object. 

The  operations  in  the  trace  were  55%  read-only,  for  which  Zyzzyva  used  its  one  round-trip 
read  optimization.  Because  read-only  operations  in  Zyzzyva  avoid  the  primary,  they  perform  sim¬ 
ilarly  to  the  Zzyzx-noPQ  line.  Zzyzx  used  the  log  interface  for  82%  of  operations,  with  an  av- 
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Figure  5.5.10:  Performance  of  Zzyzx  and  Zyzzyva  for  a  file  system  workload. 


erage  contention-free  run  length  of  4926  operations.  Of  the  18%  of  operations  executed  through 
the  Zyzzyva  interface,  56%  were  read-only  and  used  Zyzzyva’s  one  round-trip  read  optimization. 
Overall,  Zzyzx  provided  a  factor  of  1.6  x  higher  throughput  than  Zyzzyva. 

5.6  Conclusion 

Byzantine  Locking  allows  creation  of  efficient  and  scalable  Byzantine  fault-tolerant  services.  Com¬ 
pared  to  the  state-of-the-art  (Zyzzyva),  Zzyzx  delivers  a  factor  of  2.2x-2.9x  higher  throughput 
during  concurrency-free  and  fault-free  operation,  given  the  minimum  number  of  servers  (3/-I-1). 
Moreover,  unlike  previous  Byzantine  fault-tolerant  replicated  state  machine  protocols,  Zzyzx  offers 
near-linear  scaling  of  throughput  as  additional  servers  are  added. 


Chapter  6 

Conclusion 


This  dissertation  makes  a  number  of  contributions  to  the  field  of  Byzantine  fault-tolerant  distributed 
system  design,  both  in  overall  protocol  design  and  in  techniques  that  should  be  useful  in  other  future 
protocols.  It  offers  the  following  three  conclusions  and  a  few  suggestions  for  future  work: 

Byzantine  fault-tolerant  erasure-coded  storage  systems  can  provide  similar  latency  and 
throughput  as  systems  that  tolerate  only  crashes:  This  dissertation  demonstrates  that  Byzantine 
fault-tolerant  erasure-coded  storage  systems  can  be  implemented  using  similar  hardware  resources 
as  systems  that  tolerate  only  crashes  (only  m  +  2f  >3f  +  \  servers)  without  introducing  much  com¬ 
putational  overhead  beyond  a  checksum  of  the  data  or  much  network  overhead  beyond  an  additional 
roundtrip  when  writing  large  blocks  of  data. 

Byzantine  fault-tolerant  replicated  state  machines  can  service  requests  in  a  single  roundtrip 
to  t+b+1  responsive  servers  with  minimal  cryptographic  overhead:  Applications  such  as  a  dis¬ 
tributed  metadata  service  require  the  richer  semantics  of  a  state  machine.  Prior  Byzantine  fault- 
tolerant  replicated  state  machine  protocols  either  required  that  3/  -|-  1  or  more  nodes  participate  in 
each  request  or  provided  lower  throughput  and  higher  latency.  Zzyzx  provides  higher  throughput 
and  lower  latency  than  prior  protocols,  and  only  2f  +\  servers  need  participate  in  a  request  in 
the  common  case.  More  precisely,  when  tolerating  b  Byzantine  faulty  servers  and  t  crash  faults 
(t  >  b),  Xzyzx  requires  t  +  b+\  servers  to  participate  in  each  request,  out  of  2t  -|-  -|-  1  total  servers. 
Thus,  when  b  =  \,  Zzyzx  can  provide  many  of  the  benefits  of  Byzantine  fault  tolerance  with  just 
a  single  additional  server.  To  achieve  this  result,  Zzyzx  requires  that  the  workload  exhibit  low  ob¬ 
ject  contention.  Fortunately,  many  important  application,  such  metadata  services,  experience  low 
contention. 

Byzantine  fault-tolerant  replicated  state  machines  can  scale  through  workload  partitioning: 

Block  storage  protocols  can  scale  by  distributing  blocks  of  data  across  a  larger  set  of  servers,  but 
more  general  services  such  as  a  metadata  cluster  face  more  challenging  scalability.  Many  prior  pro¬ 
tocols  could  not  scale  by  partitioning  the  workload  because  requests  were  ordered  through  a  single, 
centralized  server.  Zzyzx  provides  scalable  fault  tolerance  through  Byzantine  Locking,  which  al¬ 
lows  the  workload  of  a  distributed  system  to  be  partitioned  across  distinct  sets  of  If  an  operation 
interacts  with  state  partitioned  across  different  servers,  the  state  is  aggregated  such  that  the  opera- 
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tion  remains  atomic.  Thus,  in  addition  to  providing  higher  thr'oughput  and  lower  latency  than  prior 
Byzantine  fault-tolerant  replicated  state  machines,  Zzyzx  introduces  unprecedented  scalability. 

Future  work:  This  dissertation  could  be  extended  in  a  number  of  ways.  Homomorphic  fingerprint¬ 
ing  uses  Reed-Solomon  codes  [85],  Rabin’s  Information  Dispersal  Algorithm  [84],  and  other  linear 
erasure  coding  schemes.  Some  recent  erasure  coding  schemes  are  more  efficient  (e.g..  [81]),  but 
may  generate  fewer  fragments,  and  may  not  be  linear  and  hence  not  compatible  with  homomorphic 
fingerprinting.  A  potential  extension  could  consider  efficient  linear  erasure  coding  schemes  that 
generate  many  fragments. 

Byzantine  fault-tolerant  replicated  state  machine  protocols  are  perceived  as  complicated,  but 
most  share  common  features.  For  example,  clients  can  either  access  servers  directly,  or  clients  can 
send  messages  through  a  pre-serializer  [95]  or  primary.  The  direct  approach  improves  latency  but 
suffers  under  contention  (e.g.,  Kursawe  [63],  Q/U  [2],  HQ  [32],  and  Byzantine  Locking).  The  pre¬ 
serializer  approach  increases  latency,  but  avoids  contention  (e.g.,  PBFT  [24]  and  Zyzzyva  [55]). 
Singh  et  al.  considered  adding  a  pre-serializer  to  direct-access  protocols  and  found  similarities  to 
primary-based  protocols  [95].  A  potential  extension  could  continue  to  modularize  such  concepts. 


Appendix  A 


The  Correctness  of  Byzantine  Locking 


This  appendix  provides  a  proof  of  the  safety  and  liveness  properties  of  Byzantine  Locking  and 
Zzyzx.  Zzyzx  can  be  divided  into  two  types  of  objects,  the  manager  object  and  the  log  object. 
The  manager  object  manages  lock  state  and  processes  requests  on  objects  that  experience  high 
concurrency.  It  is  implemented  using  a  Byzantine  fault-tolerant  replicated  state  machine  protocol 
such  as  Zyzzyva  [55]  or  PBFT  [24].  There  is  a  log  object  for  each  client,  and  each  log  object 
processes  requests  for  objects  that  have  been  locked  by  its  client.  Zzyzx  achieves  better  performance 
through  an  optimized  implementation  of  the  log  object. 

Section  A.l  provides  specifications  of  object  types.  Section  A.  1.1  defines  fhe  objecf-based  sfafe 
machine,  which  describes  fhe  fype  of  applications  fhaf  Zzyzx  can  efficienlly  implemenf.  Zzyzx 
can  implemenf  stafe  machines  similar  fo  fhose  considered  by  PBFT  [24],  Q/U  [2],  HQ  [32],  and 
Zyzzyva  [55].  As  in  some  of  fhese  profocols,  Zzyzx  divides  application  sfafe  info  objecfs  fo  ensure 
efficiency.  Section  A.  1.2  and  Section  A.  1.3  provide  fhe  sequential  specification  of  fhe  log  objecf 
and  fhe  manager  objecf.  Secfion  A.2  describes  how  a  manager  objecf  and  a  sef  of  log  objecfs  can 
be  combined  fo  form  a  linearizable  Byzanfine  faulf-foleranf  replicafed  sfafe  machine.  Secfion  A.  3 
describes  how  a  clienf  would  submil  requesls  in  such  a  sysfem  and  demonslrales  fhaf  such  a  sysfem 
is  live. 


A.l  Sequential  Specifications  of  Relevant  Objects 


This  secfion  provides  fhe  sequenlial  specificalion  of  fhe  objecf-based  sfafe  machine,  fhe  log  objecf, 
and  fhe  manager  objecf.  The  ferm  “objecf”  is  used  in  more  lhan  one  conlexf.  In  fhe  disfribufed 
sysfems  communify,  a  disfributed  profocol  is  used  fo  implemenf  an  objecf,  such  as  a  log  objecf, 
manager  objecf,  or  sfafe  machine  objecf.  Buf,  in  a  differenf  confexf,  fhe  implemenfalion  of  a  slate 
machine  is  often  broken  into  small  chunks  of  state  called  objects  [2,  24,  32,  55],  either  as  a  practical 
matter  or  for  protocol  reasons.  This  chapter  will  differentiate  between  the  two  by  always  referring 
to  the  log  object,  the  manager  object,  or  the  state  machine  object  in  full.  Other  references  to  the 
word  object  correspond  to  portions  of  state  for  a  state  machine  implementation. 
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A.1.1  The  Object-Based  State  Machine  Object 

An  object-based  state  machine  object  consists  of  object  state  for  a  bounded  number  of  objects 
and  the  execute_request  operation.  Each  object  is  referred  to  by  its  object  identifier,  denoted 
oid,  and  the  value  of  an  object  is  denoted  state[oid].  Each  state[oid]  is  initialized  to  a  de¬ 
fault  value.  The  execute_request  operation  takes  as  input  the  current  state  and  a  request,  ap¬ 
plies  the  request  to  the  current  state,  and  returns  the  new  state  and  a  response.  In  summary, 
(response, state')  ^  execute_request(req, state). 

An  identifier  for  fhe  clienf  fhaf  submiffed  fhe  request  is  often  included  in  the  request. 
Eor  notational  simplicity,  the  pseudo-code  will  use  the  function  execute_request,  which  dif¬ 
fers  from  execute  .request  in  two  ways.  Eirst,  the  client  identifier  cid  is  wriffen  ex¬ 
plicitly.  Second,  execute  .request  modifies  state,  so  it  need  not  return  state'  (i.e.,  state 
is  passed  by  reference).  Thus,  in  the  pseudo-code,  the  state  machine  is  called  as  fol¬ 
lows:  response  ^  execute.request(cid,  req, state),  which  is  equivalent  to  (response, state)  ^ 
execute _request(req, state)  if  req  includes  cid. 

Byzantine  Eocking  requires  a  compact  representation  of  the  set  of  objects  that  a  request  may 
touch,  denoted  oids .touched (req),  and  an  efficient  mechanism  to  check  whether  an  object  may 
be  touched  by  a  request  before  executing  the  request,  denoted  oid  G  oids.touched(req).  The  ap¬ 
plication  of  requests  is  deterministic.  Thus,  the  application  of  requests  on  disjoint  sets  of  objects 
commutes.  That  is,  the  relative  order  of  requests  that  touch  non-overlapping  sets  of  objects  does  not 
matter.  Eor  example,  suppose 

(response!, statei)  ^ executej'equest(reqj,stateo) 

(response2,state2)  ^ execute J'equest(req2, statei) 
and 

(response^, state))  ^  execute j'equest(req2,stateo) 

(response), state))  ^  execute j'equest) req j, state)) 

If  reqi  and  req2  operate  on  disjoint  sets  of  objects  (that  is, 
oids.touched(req  i)  n  oids  .touched  (req  2)  =  0),  then  the  order  of  the  requests  does  not 
matter,  and  so  response!  =  response),  response2  =  response),  and  state2  =  state). 

Note  the  difference  between  state  machine  interfaces  that  require  objects  to  be  specified  in 
advance  (as  in  Zzyzx,  HQ  [32],  and  Q/U  [2])  and  inferfaces  fhaf  do  nof  (as  in  Zyzzyva  [55] 
and  PBET  [24]).  Thai  is,  some  sfafe  machines  do  nof  require  an  efficienf  and  compacf  function 
oids.touched(req)  fhaf  can  be  used  fo  specify  which  objects  are  touched  in  advance.  In  practice, 
specifying  a  working  set  of  objects  in  advance  is  easy  in  many  important  systems,  such  as  file  sys- 
fems.  Eurfhermore,  sfafe  machine  implemenfafions  offen  already  divide  fhe  sfafe  info  objecfs  fo 
allow  efficienf  checkpointing  and  sfafe  fransfer.  Buf,  inferfaces  fhaf  do  nof  require  advance  descrip- 
fions  of  objecfs  fouched  can  easily  supporf  some  applications  fhaf  are  challenging  fo  support  in  an 
efficient  manner  using  interfaces  that  require  advance  descriptions  (e.g.,  consider  dereferencing  a 
pointer  within  an  object).  Of  course,  oids.touched(. . .)  can  be  conservative,  perhaps  even  reporting 
that  all  objects  may  be  touched  for  each  operation,  so  both  interfaces  can  implement  the  same  set  of 
applications,  though  an  overly  conservative  implementation  of  oids  .touched  (. . .)  may  impact  the 
performance  of  Zzyzx,  HQ  [32],  and  Q/U  [2]. 
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A.1.2  The  Log  Object 

A  log  object  consists  of  object  state,  denoted  by  the  array  state[];  the  request  log  and  its  position, 
denoted  request-log  and  reqno;  and  a  nondecreasing  value  vs  for  synchronization  with  the  manager 
object.  Initially,  state [*]  =  NULL,  request-log [*]  =  NULL,  reqno  =  0,  and  vs  =  1.  A  log  object 
supports  three  operations:  Append,  Import,  and  Fetch+Unlock.  If  state[oid]  =  null,  then 
old  is  said  to  be  unlocked;  otherwise,  old  is  locked.  Operations  are  associated  with  message  tags  for 
remote  procedure  calls  in  the  pseudo-code,  and  so  each  request  is  written  out  as  (I AG,  args).  The 
sequential  specification  of  each  operation  is  as  follows: 

(Append, cid,  req)  If  each  object  touched  by  req  is  locked,  then  let  reqno  ^  reqno  +  1, 
request-log  [reqno]  ^  req,  and  perform  execute-request(cid,  req,  state)  and  return  its  result. 
Otherwise,  return  FAILURE. 

(Import,  vs',  old,  obj)  If  vs'  =  vs  and  state[oid]  =  NULL,  then  let  state[oid]  <—  obj. 

(FETCH+UNLOCK,oid)  Let  state[oid]  ^  null,  let  vs  ^  vs  +  1,  and  return  the  triplet 
(vs,  old,  request-log). 

In  a  typical  execution,  a  client  will  IMPORT  several  objects  and  Append  several  requests.  If 
FAILURE  is  returned,  the  client  retries  the  request  at  the  manager  object.  If  object  old  is  needed  at 
the  manager  object,  a  client  can  invoke  Fetch-i-Unlock  and  replay  request-log  at  the  manager 
object,  which  unlocks  old.  In  Zzyzx,  each  log  object  is  associated  with  a  single  designated  client 
(denoted  cid)  that  invokes  operations  Append  and  Import,  such  that  the  argument  to  Append  is 
always  cid. 

A.  1.3  The  Manager  Object 

A  manager  object  consists  of  object  state,  denoted  by  the  array  state)];  nondecreasing  values  vScid 
for  synchronizing  with  the  log  object  for  designated  client  cid;  an  array  locktable)],  such  that 
locktable[oid]  =  NULL  if  oid  is  unlocked  or  locktable[oid]  =  cid  if  oid  is  locked  by  the  log  ob¬ 
ject  for  designated  client  cid;  and  requo^id  describing  the  most  recently  replayed  request  for  cid. 
It  supports  three  operations:  Exec,  Lock,  and  Replay-i-Unlock.  Initially,  state[oid]  is  set  to 
a  default  value  for  each  oid,  vScid  =  L  locktable)*]  =  null,  and  requo^id  =  0.  A  client  identifier 
cid  uniquely  idenfifies  fhe  designafed  client.  For  any  designated  client,  there  is  at  most  one  log 
object.  A  client  can  be  the  designated  client  for  multiple  log  objects  by  using  multiple  client  iden¬ 
tifiers.  Variable  requo^.^  counfs  fhe  requesfs  replayed  for  clienf  cid,  nof  all  execufed  requesfs.  The 
sequential  specificafion  of  each  operafion  is  as  follows: 

(Exec, cid,  req)  If  each  oid  touched  by  req  is  unlocked  (locktable[oid]  =  NULL),  fhen  perform 
execute -request(cid,  req,  state)  and  return  its  result.  Otherwise,  return  FAILURE. 

(Lock, cid, oid)  If  locktable[oid]  =  NULL,  then  locktable[oid]  <—  cid  and  return 
(vScid,oid,state[oid]).  Otherwise,  return  FAILURE. 
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(REPLAY+UNLOCK,cid,vs',oid,  request_log)  If  vs'  <  vScid  + 1,  return  FAILURE.  Otherwise,  do  as 
follows.  First,  let  vScid  vs'.  Second,  while  request_log[reqnOcid  +  1]  /  NULL  let  requo^-id  ^ 
reqnOcid  +  1  and  perform  executej'equest(cid,  requestJog[reqno^ij], state).  Third,  if  oid  ^ 
NULL  and  locktable[oid]  =  cid,  set  locktable[oid]  ^  NULL.  Fourth,  return  SUCCESS. 

In  a  typical  execution,  a  client  invokes  Exec  and  may  LOCK  its  working  set  of  objects.  To 
allow  access  to  an  object  locked  by  a  log  object,  a  client  invokes  Fetch+Unlock  at  the  log 
object  before  invoking  Replay+Unlock  to  reconcile  state  between  the  log  object  and  the  manager 
object,  which  unlocks  the  object. 

A.2  Linearizability 

This  section  describes  how  a  manager  object  and  a  set  of  log  objects  can  be  combined  to  form  a 
linearizable  Byzantine  fault-tolerant  replicated  state  machine. 

A.2.1  The  Reads-from  Relation 

Definition  A.2.1.  A  Replay -i-Unlock  operation  reads  from  a  Fetch-i-Unlock  opera¬ 
tion  if  the  cid  passed  to  Replay -i-Unlock  is  the  designated  cid  for  the  log  object  at  which 
Fetch-i-Unlock  was  invoked  and  if  Fetch-i-Unlock  returns  the  (vs,  oid,  request_log)  triplet  that 
matches  the  arguments  to  Replay -i-Unlock. 

Definition  A. 2. 2.  An  Import  operation  reads  from  a  Lock  operation  if  the  cid  passed  to  Lock 
is  the  designated  cid  for  the  log  object  at  which  IMPORT  was  invoked  and  the  (vs,oid,obj)  triplet 
returned  by  LOCK  matches  the  arguments  to  IMPORT. 

Definition  A. 2. 3.  An  operation  history  of  a  manager  object  and  a  set  of  log  objects  is  reads-from 
valid  if  each  Replay-i-Unlock  reads  from  a  Letch-i-Unlock  operation  and  each  IMPORT  reads 
from  a  LOCK  operation. 


A.2.2  The  Equivalence  of  Replayed  Requests 

In  this  section,  consider  a  reads-from  valid  operation  history.  Consider  the  call  to 
execute  .request  (.. .)  during  a  Replay -i-Unlock  for  some  reqno^i^  such  that  req  = 
request_log[reqnOcid]: 

retval  ^  execute j'equest(cid, req, state)  (A.2.4) 


Consider  the  call  to  execute  j'equest(. . .)  during  an  Append  at  the  log  object  with  designated  client 
cid  for  some  reqno'  such  that  req'  =  requestJog[reqno']: 


retval'  ^  execute_request(cid,  req',  state') 


(A.2.5) 


A.2.  LINEARIZABILITY 


85 


This  section  will  show  that  the  retval  ^  execute_request(cid,  req, state)  for  reqno^u  in  Equa¬ 
tion  A.2.4  at  the  manager  object  during  a  Replay -i-Unlock  corresponds  to  some  retval'  ^ 
execute _request(cid,  req',  state')  for  reqno'  in  Equation  A.2.5  at  the  log  object  during  an  Append. 
Eemma  A.2.6  shows  that  req  =  req'  and  requo^jd  =  reqno'.  Eemma  A.2.8  shows  that  retval  =  retval', 
and  that,  after  execute_request(. . .)  at  the  manager  object  and  log  object,  for  each  old  such  that 
state'[oid]  ^  null,  it  is  the  case  that  state[oid]  =  state'[oid]  and  locktable[oid]  =  cid. 

Eemma  A. 2. 6.  Eor  each  call  to  execute_request(cid,  req, state)  in  Equation  A.2.4  for  reqno^jd 
by  operation  (Replay-i-Unlock, cid,  *,*,*)  at  the  manager  object,  there  exists  a  call  to 
execute_request(cid,  req',  state')  in  Equation  A.2.5  by  operation  (Append,  cid,  req')  for  reqno'  at 
the  log  object,  such  that  req  =  req'  and  reqno'  =  reqno^-id. 


Proof.  Because  the  operation  history  is  reads-from  valid,  the  Replay-i-Unlock  reads  from  a 
Eetch-i-Unlock  that  returned  requestJog  such  that  requestJog[reqnOcici]  =  req.  Because  only 
Append  appends  requests  to  the  request  log,  an  Append  for  reqno'  must  have  appended  req'  to 
request-log  such  that  reqno'  =  reqno^id  and  req'  =  req.  □ 

Eemma  A.2.7  will  form  the  base  case  of  the  induction  in  Eemma  A.2.8.  In  particular,  Eemma  A.2.7 
shows  that,  after  execute_request(. . .)  at  the  manager  object  and  log  object,  for  each  oid  such  that 
state'[oid]  f  null,  it  is  the  case  that  state[oid]  =  state'[oid]  and  locktable[oid]  =  cid,  but  only 
when  the  last  operation  to  modify  oid  at  the  log  object  was  an  IMPORT.  Eemma  A.2.8  removes  the 
restriction  that  the  last  operation  to  modify  oid  was  an  IMPORT. 

Eemma  A.2.7.  Eet  state'[oid]  be  the  value  of  oid  at  the  log  object  before  the  call  to 
execute -request  (.. .)  by  an  Append  in  Equation  A.2.5  for  reqno'.  Suppose  state'[oid]  f  null 
and  the  last  operation  to  modify  oid  at  the  log  object  was  (IMPORT, vs, oid, obj).  Then,  it  is  the 
case  that  state'[oid]  =  state[oid]  and  locktable[oid]  =  cid,  where  state[oid]  and  locktable[oid]  are 
the  values  at  the  manager  object  either  at  the  end  of  the  operation  history  if  reqno'  is  not  replayed,  or 
before  the  call  to  execute _request(. . .)  in  Equation  A.2.4  for  reqno<-id  =  reqno'  if  reqno'  is  replayed. 


Proof.  Because  the  operation  history  is  reads-from  valid,  IMPORT  reads  from  a  EOCK  that 
sets  locktable[oid]  =  cid  and  returns  (vs,  oid,  obj)  such  that  obj  =  state  [oid].  IMPORT  then  sets 
state' [oid]  <—  obj,  which  is  not  modified  until  the  Append  (the  Import  was  the  last  operation  to 
modify  oid).  The  remainder  of  this  proof  shows  that  oid  is  not  modified  at  the  manager  object  after 
the  EoCK  until  reqnOc-id  is  replayed,  if  ever,  so  state[oid]  =  obj  and  thus  state'[oid]  =  state[oid]. 
There  are  two  cases:  it  must  be  shown  that,  after  the  lock  but  before  reqno^ij  is  replayed,  no  request 
in  an  Exec  operation  and  no  replayed  request  modifies  oid. 

No  Exec  modifies  oid  after  the  EoCK  until  reqnOcid  is  replayed  because  oid  remains  locked  in 
that  period,  as  follows.  Only  Eetch-i-Unlock  can  set  state' [oid]  =  null.  Because  state' [oid] 
NULL  before  the  Append,  any  such  Eetch-i-Unlock  either  precedes  the  Import,  such  that  the 
value  vs'  before  Eetch-i-Unlock  is  less  than  vs  (less  than  because  Eetch-i-Unlock  increments 
vs'),  or  follows  the  Append.  A  Replay -i-Unlock  for  vs'  must  precede  a  EoCK  for  vs  if  vs'  <  vs, 
and  a  Replay -i-Unlock  that  reads  from  a  Eetch-i-Unlock  that  follows  the  Append  for  reqno'  = 


86 


APPENDIX  A.  THE  CORRECTNESS  OE  BYZANTINE  EOCKING 


reqnO(-y  will  replay  reqno^-y  if  reqno^-y  has  yet  to  be  replayed.  Thus,  after  the  LOCK  but  before  any 
replay  of  repno^-jcj,  it  is  the  case  that  locktable[oid]  =  cid,  and  thus  no  Exec  modifies  oid. 

Similarly,  no  replayed  request  modifies  oid  after  the  LOCK  until  reqno^y  is  replayed,  because 
the  Append  is  the  first  Append  to  modify  oid  after  the  Import.  Any  request  replayed  before 
reqnO(.y  must  precede  not  only  Append  but  also  Import  (otherwise,  the  last  operation  to  modify 
oid  would  not  be  the  IMPORT).  Furthermore,  because  state[oid]  =  NULL  before  the  IMPORT,  any 
prior  request  that  modified  oid  must  have  preceded  a  Fetch+Unlock'  that  incremented  vs'  <  vs 
and  set  state[oid]  =  NULL.  Because  state[oid]  /  NULL  when  the  prior  request  modifies  oid,  the 
prior  request  follows  an  (Import',  vs", oid, obj),  where  vs"  <  vs'  <  vs.  Because  Import'  reads 
from  a  Lock'  that  returns  vs"  (by  reads-from  valid)  and  because  vs"  <  vs.  Lock'  precedes  LOCK. 
Because  locktable[oid]  =  NULL  when  LOCK  succeeds,  a  Replay+Unlock'  that  reads  from  a 
Fetch+Unlock'  that  follows  the  prior  request  must  have  executed,  replaying  the  prior  request  if 
it  has  yet  to  be  replayed.  Thus  any  request  that  modifies  oid  before  repno^id  is  replayed  would  be 
replayed  before  the  LOCK  for  vs.  □ 

The  following  lemma  ties  the  state  of  the  log  object  together  with  that  of  the  manager  object 
before  and  after  the  execution  of  each  request. 

Lemma  A. 2. 8.  For  each  call  to  retval  ^  execute_request(cid,  req, state)  in  Equation  A.2.4  for 
reqnO(.|d  at  the  manager  object,  there  exists  a  call  to  retval'  ^  execute_request(cid,  req', state') 
in  Equation  A.2.5  for  reqno'  at  the  log  object  such  that  reqno'  =  reqno^jd  and  retval  =  retval'. 
Furthermore,  for  each  oid  such  that  state' [oid]  ^  NULL  during  the  call  at  the  log  object,  during  the 
call  at  the  manager  object  it  is  the  case  that  locktable[oid]  =  cid,  and  after  both  calls  it  is  the  case 
that  state' [oid]  =  state[oid]. 


Proof.  That  each  execute  .request  (.. .)  at  the  manager  object  during  a  Replay+Unlock  is 
matched  by  one  at  the  log  object  is  proven  in  Lemma  A.2.6.  The  remainder  of  the  lemma  is  proven 
by  induction  over  the  value  of  reqno: 

Inductive  Hypothesis:  Consider  the  call  to  execute  j'equest(*,  *,  state')  at  the  log  object  for  reqno' 
by  an  Append  in  Equation  A.2.5.  Consider  the  corresponding  call  to  execute j'equest(*,  *, state) 
at  the  manager  object  for  reqno^id  in  Equation  A.2.4.  For  each  oid  such  that  state'[oid]  7^  NULL 
during  the  call  at  the  log  object,  it  is  the  case  that  locktable[oid]  =  cid  during  the  call  at  the  manager 
object  and  that  state' [oid]  =  state[oid]  before  the  calls  at  both  the  log  object  and  the  manager  object. 

Corollary  to  Inductive  Hypothesis:  Suppose  the  inductive  hypothesis  holds.  Append  fails  if  any 
modified  object  is  unlocked,  so,  for  each  modified  object,  it  is  the  case  that  state' [oid]  /  NULL  and 
thus  state'[oid]  =  state[oid].  Hence,  because  req'  =  req  (by  Lemma  A.2.6)  and  by  determinism,  it 
is  the  case  that  retval  =  retval'  and  that  state' [oid]  =  state[oid]  after  the  calls  at  the  log  object  and 
the  manager  object,  as  required  by  the  lemma. 

Base  Case:  Consider  the  first  call  to  execute  j'equest (cid,  *,  *)  in  a  Replay+Unlock  at  the  man¬ 
ager  object,  which  sets  reqno^^  =  1,  and  the  corresponding  first  call  to  execute J'equest(cid,  *,  *) 
in  the  first  successful  Append  at  the  log  object,  which  sets  reqno'  =  1.  For  each  oid  such 
that  state' [oid]  7^  null,  the  last  operation  to  modify  oid  must  have  been  an  IMPORT.  Thus,  by 
Lemma  A.2.7,  it  is  the  case  that  locktable[oid]  =  cid  and  state'[oid]  =  state[oid]. 
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Inductive  Step:  Consider  the  call  to  execute  j'equest(*,  state')  at  the  log  object  by  an  Append  in 
Equation  A.2.5,  which  sets  reqno',  and  the  call  to  execute _request(*,  *,  state)  at  the  manager  object 
in  Equation  A.2.4,  which  sets  reqno^u  =  reqno'.  Consider  each  oid  such  that  state'[oid]  /  NULL 
during  the  call  at  the  log  object.  The  last  operation  to  modify  oid  at  the  log  object  was  either  an 
Import  or  an  Append'.  If  the  last  such  operation  was  an  Import,  then,  by  Eemma  A.2.7,  it  is  the 
case  that  locktable[oid]  =  cid  and  state'[oid]  =  state[oid]. 

If  the  last  operation  to  modify  oid  at  the  log  object  was  an  Append'  for  some  reqno",  then 
reqno"  <  reqno'  =  reqno^i^j  (because  Append'  precedes  Append).  Because  the  manager  object 
replays  requests  in  sequence  from  1  to  repno^-jcj,  there  is  a  corresponding  execute -request  (. . .)  at 
the  manager  object  that  replays  reqno".  By  the  inductive  hypothesis  at  reqno"  <  reqno'  =  reqno^y, 
and  its  corollary,  after  the  corresponding  execute_request(. . .)  replays  reqno"  at  the  manager  object, 
locktable[oid]  =  cid  and  state'[oid]  =  state[oid].  Because  no  other  Append  modifies  oid  between 
reqno"  and  reqno',  no  request  that  modifies  oid  is  replayed  between  reqno"  and  reqno'.  Other  than 
replayed  requests,  only  Exec  operations  modify  objects.  Because  locktable[oid]  =  cid,  an  Exec 
will  not  modify  oid  unless  oid  is  unlocked.  Thus,  if  locktable[oid]  remains  cid  at  the  manager  object, 
then  state'[oid]  =  state[oid]  before  the  calls  to  execute -request  (. . .)  for  reqno',  and  the  lemma  is 
proven. 

Suppose  oid  is  unlocked  by  Replay+Unlock'  at  the  manager  object  after  reqno"  is  re¬ 
played  but  before  reqno'  is  replayed.  Thus,  Replay -i-Unlock'  must  have  read  from  some 
Eetch-i-Unlock'  at  the  log  object  that  unlocked  oid  and  followed  Append'  but  preceded  Append. 
This  is  a  contradiction,  because  either  oid  will  be  unlocked  for  Append  and  Append  will  fail,  or 
an  Import  will  lock  oid,  in  which  case  the  previous  operation  was  not  Append'.  □ 

A.2.3  Requests  that  are  not  Replayed 

A  client  will  invoke  Append  operations  at  the  log  object,  and  each  successful  Append  will  be 
appended  to  the  request  log  in  order  at  positions  reqno}.  The  manager  object  replays  some 

of  these  requests  in  order,  requests  reqnOcid}-  Eemma  A.2.8  shows  that  for  each  reqno}|j 

replayed  by  the  manager  object,  there  exists  reqno'  executed  by  the  log  object  such  that  reqno'  = 
reqno}y  and  the  request  and  return  values  match.  But,  it  may  be  the  case  that  repno^id  <  reqno, 
in  which  case  one  or  more  Append  operations  execute  requests  {reqno^id  +  1, . . . , reqno}  at  the 
log  object  that  are  never  replayed  by  the  manager  object.  This  section  shows  that  the  return  values 
from  such  requests  match  the  return  values  that  would  have  been  generated  had  the  requests  been 
replayed  at  the  manager  object.  This  fact  should  come  as  no  surprise,  because,  if  the  requests  were 
replayed  in  the  future,  their  return  values  must  match. 

Eemma  A. 2. 9.  Consider  any  operation  history  of  a  manager  object  and  a  set  of  log  objects  that  is 
reads-from  valid.  Consider  the  sequence  of  requests  in  calls  to  execute -request  (. . .)  by  the  Exec 
and  Replay-i-Unlock  operations  at  the  manager  object,  and  consider  the  sequence  of  requests 
in  calls  to  execute -request(. . .)  by  the  successful  Append  operations  at  each  log  object  that  are 
not  replayed  at  the  manager  object.  Construct  a  request  history  by  concatenating  the  sequence  of 
requests  from  the  manager  object  followed  by  the  sequence  from  each  log  object  (in  any  order),  and 
placing  each  return  value  after  its  request.  This  request  history  forms  a  legal  sequential  history  for 
an  object-based  state  machine  object. 
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Proof.  Section  A.  1.1  defines  an  object-based  state  machine  object  by  the  execute J'equest(. . .) 
function  and  the  current  object  state,  such  that  a  sequential  history  is  legal  if  the  sequence  of  re¬ 
quests  provided  to  execute_request(*,  state)  would  produce  the  matching  return  values.  Thus, 
the  sequence  of  requests  and  return  values  found  in  calls  to  execute_request(*,  *,  state)  by  Exec 
and  Replay -i-Unlock  operations  at  the  manager  object  forms  a  legal  sequential  history  for  an 
object-based  state  machine  object. 

If  a  request  that  is  not  replayed  modifies  oid  at  the  log  object  with  designated  client  cid,  then 
locktable[oid]  =  cid  and  state[oid]  =  state'[oid]  at  the  manager  object  at  the  end  of  the  operation 
history,  such  that  state' [oid]  is  the  value  at  the  log  object  before  the  first  request  that  is  not  replayed 
modifies  oid,  as  follows.  Consider  the  first  request  that  is  not  replayed  that  modifies  oid.  If  the 
prior  request  to  modify  oid  was  an  IMPORT,  then,  by  Lemma  A.2.7,  it  is  the  case  that  state' [oid]  = 
state[oid]  and  locktable[oid]  =  cid.  If  the  prior  request  to  modify  oid  was  an  Append,  then,  by 
Lemma  A.2.8,  it  is  the  case  that  state'[oid]  =  state[oid]  and  locktable[oid]  =  cid. 

Because  state[oid]  =  obj  =  state'[oid],  and  because  the  return  value  of  execute_request(. . .) 
depends  only  on  the  request  and  the  current  object  state,  any  sequence  of  requests  and  return  values 
from  each  log  object  can  be  appended  to  the  sequence  of  requests  from  the  manager  object  while 
remaining  a  legal  history.  Because  locktable[oid]  =  cid,  the  sequences  from  each  log  object  op¬ 
erate  on  disjoint  sets  of  objects,  and  thus  can  be  appended  in  any  order  (see  Section  A.  1.1)  while 
remaining  a  legal  sequential  history.  □ 


A.2.4  Real-time  precedence:  Reads-from  Strict 

Lor  any  history,  operation  o  is  said  to  precede  operation  o'  if  the  response  to  operation  o  precedes  the 
invocation  of  operation  o'  in  the  history.  Thus,  a  history  imposes  a  partial  ordering  on  its  operations 
according  to  this  precedence  relationship.  The  precedence  relationship  of  a  history  of  events  from 
the  execution  of  a  distributed  system  is  called  real-time  precedence. 

Remark  A. 2. 10.  The  partial  order  imposed  by  a  history  can  be  represented  as  a  directed  acyclic 
graph  (a  DAG  on  operations).  Consider  the  operation  history  of  a  manager  object  and  a  set  of  log 
objects.  Choose  a  linearization  of  each  log  object  and  the  manager.  The  linearization  provides 
additional  edges  to  the  DAG  such  that  the  operations  on  each  object  are  totally  ordered.  The  DAG 
remains  acyclic  because  linearization  respects  the  real-time  precedence  reflected  in  the  operation 
history. 

Definition  A. 2. 11.  A  reads-from  edge  is  an  edge  (o,o')  from  operation  o  to  operation  o'  such 
that  operation  o'  reads  from  operation  o. 

Definition  A. 2. 12.  An  operation  history  is  reads-from  strict  if,  for  each  reads-from  edge  (o,o'), 
operation  o  precedes  operation  o'  in  the  operation  history. 

Remark  A. 2. 13.  If  operation  o  precedes  operation  o',  then  there  is  an  edge  from  o  to  o'  in  the 
operation  history.  Thus,  the  DAG  of  a  reads-from  strict  operation  history  includes  each  reads-from 
edge.  (If  edge  (o,o')  is  a  reads-from  edge,  then  edge  (o,o')  is  in  the  DAG.) 
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A.2.5  The  Linearizability  of  Exec  and  Append 

Theorem  A. 2. 14.  Consider  any  operation  history  of  a  manager  object  and  a  set  of  log  objects  that 
is  reads-from  valid  and  reads-from  strict.  The  request  history  consisting  of  the  requests  and  return 
values  from  each  Exec  and  successful  Append  operation  in  the  operation  history  is  linearizable. 


Proof.  Lemma  A.2.9  proves  that  the  request  history  consisting  of  the  requests  and  return  val¬ 
ues  from  each  Exec  and  Replay-pUnlock  operation  concatenated  with  the  requests  and  re¬ 
turn  values  by  successful  Append  operations  that  are  not  replayed  forms  a  legal  sequential  his¬ 
tory.  Lemma  A.2.8  proves  that  each  request  and  return  value  in  a  Replay -i-Unlock  operation  is 
matched  by  a  request  and  return  value  in  an  successful  Append  operation.  Thus,  there  is  a  request 
history  consisting  of  the  requests  and  return  values  from  each  Exec  and  Append  operation  in  the 
operation  history  that  is  a  legal  sequential  history.  The  remainder  of  this  proof,  then,  must  prove 
that  there  is  such  a  history  that  respects  real-time  precedence. 

Consider  the  legal  sequential  history  from  Lemma  A.2.9,  formed  by  concatenating  Exec  and 
Replay -pUnlock  operations  and  appending  un-replayed  Append  operations.  Replace  each  Re¬ 
play -pUnlock  by  the  corresponding  sequence  of  Append  operations  that  are  replayed  to  form 
a  sequence  of  all  of  the  successful  Exec  and  Append  operations.  Note  that  the  subsequence  of 
Exec  operations  at  the  manager  object  and  the  subsequence  of  Append  operations  for  each  log 
object  are  in  the  order  executed  by  the  manager  object  and  each  log  object.  Eurthermore,  note  that 
each  Append  operation  can  be  moved  earlier,  so  long  as  it  follows  the  Lock  of  each  object  that  it 
touches  and  remains  in  the  same  order  relative  to  other  Appends  at  that  log  object. 

This  order  is  precisely  the  order  constraint  imposed  by  the  DAG  in  Remai'k  A.2.10  if  the  op¬ 
eration  history  is  reads-from  strict:  the  Exec  and  Append  operations  are  in  the  relative  order 
imposed  by  the  linearization  of  the  manager  object  and  each  log  object,  each  Append  follows  the 
Import  that  follows  the  Lock  for  each  object  that  it  modifies,  and  each  Append  precedes  the 
Eetch-i-Unlock  that  precedes  the  Replay -pUnlock  that  replays  the  Append.  Thus,  total  or¬ 
dering  of  the  DAG  in  Remark  A.2.10  for  a  reads-from  valid  and  reads-from  strict  operation  history 
produces  a  linearization  of  the  request  history  consisting  of  the  requests  and  return  values  from  each 
Exec  and  Append  operation.  □ 

Thus,  to  build  a  linearizable  replicated  state  machine  from  a  manager  object  and  a  set  of  log 
objects,  each  client  should  issue  a  request  and  return  its  result  through  an  Exec  operation  or  a 
successful  Append  operation.  If  an  Append  operation  fails  at  a  log  object,  the  request  can  be 
retried  in  an  Exec  operation  at  the  manager  object.  If  an  Exec  operation  fails  at  the  manager 
object,  the  locked  objects  that  are  causing  the  failure  can  be  unlocked  and  the  Exec  operation  can 
be  retried.  The  next  section  considers  the  implementation  of  the  client  so  as  to  ensure  liveness. 


A.3  The  Client  and  The  Primary 

Eor  any  reads-from  valid  and  reads-from  strict  operation  history  of  a  manager  object  and  a  set  of 
log  objects.  Section  A.2  shows  that  the  request  history  consisting  of  the  requests  and  return  values 
found  in  the  successful  Exec  and  Append  operations  is  linearizable.  This  section  describes  how 
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pid. primary  (op): 

2000:  if  (op  =  (EXEC,cid,  req))  then 

2001:  for  (oid  g  oids_touched(req)  :  locktable[oid]  ^  null)  do 

2002:  cid  ^  locktable[oid] 

2003:  (vs'.oid',  request_log}  <— logobj(-|;j((FETCH+UNLOCK,oid)) 

2004:  mgr_order((REPLAY+UNLOCK,cid,  vs',oid',  request.log)) 

2005: 

2006:  if  {op=/=  (Replay+Unlock, *,*,*,*))  then 
2007:  mgr_order(op) 

Figure  A.3.1:  Primary  pseudo-code.  A  correct  primary  unlocks  objects  that  are  touched  by  an 
Exec  operation  before  the  Exec  operation  is  ordered. 


a  client  would  operate  in  such  a  system  as  to  ensure  liveness  as  well  as  the  reads-from  valid  and 
reads-from  strict  properties. 

A.3.1  Reads-from  Valid  and  Reads-from  Strict 

Reads-from  valid  and  reads-from  strict  properties  can  be  ensured  as  follows.  To  invoke  an 
Import  or  a  Replay -tUNLOCK,  a  client  must  first  provide  proof  in  the  invocation  of  Im¬ 
port  or  Replay -bUNLOCK  that  the  Eock  or  EETCH-bUNLOCK  from  which  the  Import 
or  Replay-i-Unlock  reads  has  completed.  Thus,  each  REPLAY-bUNLOCK  reads  from  a 
EETCH-bUNLOCK  operation  and  each  IMPORT  reads  from  a  EoCK  operation,  satisfying  reads-from 
valid  (Definition  A.2.3).  Eurthermore,  consider  each  reads-from  edge  from  a  EoCK  to  an  IMPORT 
or  from  a  EETCH-bUNLOCK  to  a  REPLAY-bUNLOCK.  Because  the  EoCK  or  EETCH-bUNLOCK 
completes  before  the  IMPORT  or  REPLAY-bUNLOCK  is  invoked,  the  EoCK  or  EETCH-bUNLOCK 
precedes  the  IMPORT  or  REPLAY-bUNLOCK  in  the  operation  history,  satisfying  reads-from  strict 
(Definition  A.2.12). 

In  Zzyzx,  the  manager  object  that  executes  the  EoCK  or  the  log  object  that  executed  the 
EETCH-bUNLOCK  provide  cryptographic  proof  of  their  completion  to  the  client.  The  client  must 
provide  this  cryptographic  proof  before  an  IMPORT  or  REPLAY-bUNLOCK  can  be  invoked  that  reads 
from  the  EoCK  or  EETCH-bUNLOCK.  Even  if  the  client  is  faulty,  the  existence  of  the  proof  of  the 
response  to  the  EoCK  or  EETCH-bUNLOCK  in  the  invocation  of  IMPORT  or  REPLAY-bUNLOCK 
ensures  that  the  response  preceded  the  invocation. 

A.3.2  The  Primary 

Some  replicated  state  machine  protocols  elect  a  temporary  designated  leader,  known  as  the  primary, 
from  the  set  of  servers,  known  as  replicas.  All  operations  are  then  ordered  through  the  primary, 
which  allows  efficient  operation  under  contention.  If  the  primary  orders  an  operation  correctly,  the 
protocol  ensures  that  the  operation  completes.  Because  a  faulty  primary  may  not  properly  order 
operations,  such  protocols  must  ensure  either  that  the  primary  orders  each  operation,  such  that  the 
operation  completes,  or  that  a  new  primary  is  elected. 

Such  protocols  elect  a  new  primary  based  on  timeouts,  as  follows.  If  a  client  does  not  get 
a  response  quickly  enough,  it  broadcasts  the  operation  to  all  replicas.  The  replicas  forward  the 
operation  to  the  primary  and  start  a  timer.  If  the  timer  runs  out  before  the  operation  is  ordered,  each 
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LockedOids  <—  0  /*  Initialize  client’s  list  of  locked  objects  */ 

cid.issue(req):  /*  Issue  a  request  *1 

2100:  for  (oid  :  /*oid  meets  lock  criteria*/)  do 
2101:  retval  ^  mgrobj{(LoCK,cid,oid)) 

2102:  if  (retval  failure)  then 

2103:  (vs, oid, obj)  <— retval 

2104:  logobjj-ij  (  (Import,  vs,  old ,  obJ) ) 

2105:  LockedOids  LockedOids  U  {old} 

2106:  end  for 


2107:  if  (^  oid  6  oids_touched(req) :  oid  ^ LockedOids)  then 
2108:  retval  ^  logobjj.|;j( (APPEND,  cid,  req)) 

2109:  if  (retval  ft  eailure)  then 

2110:  return  retval 

2111:  end  if 
2112: 

2113:  /*  Couldn’t  use  logobj,  so  use  mgrobj  */ 

2114:  LockedOids  LockertOir/i  \oids_touched(req) 

2115:  return  mgrobj((EXEC,cid,  req)) 


Figure  A.3.2:  Client  pseudo-code.  The  client  tries  to  lock  its  working  set  of  objects.  If  it  believes 
it  has  locked  all  objects  that  are  touched,  it  invokes  Append  at  the  log  object.  If  Append  fails  or  a 
touched  object  is  locked,  it  invokes  Exec  at  the  manager  object. 


replica  votes  to  elect  a  new  primary.  After  enough  votes,  a  new  primary  is  chosen  (often  in  round- 
robin  fashion  among  the  replicas).  For  each  new  primary,  the  timeout  value  is  increased  (often  by  a 
small  multiplicative  factor).  It  must  be  the  case  that,  given  an  infinite  number  of  elections,  a  correct 
primary  is  elected  an  infinite  number  of  times  (e.g.,  by  choosing  among  /  +  1  or  more  replicas  in  a 
round-robin  fashion). 

Zzyzx  uses  the  primary  at  the  manager  object  to  invoke  Fetch-i-Unlock  and  Re¬ 
play -i-Unlock  operations.  Because  the  primary  orders  all  operations,  it  can  ensure  that  an  Exec 
operation  never  returns  FAILURE,  as  shown  in  Figure  A.3.1.  When  a  client  invokes  an  operation 
on  the  manager  object  (e.g.,  mgrobj(op  =  (Exec,  req))),  the  protocol  provides  the  operation  to 
the  primary  (pid.primary(op)  in  Figure  A.3.1),  which  must  order  the  operation  (mgr_order(op) 
on  line  2007).  Upon  receiving  an  (EXEC,cid,  req)  operation  that  touches  one  or  more  locked  oids 
(line  2000),  each  oid  is  unlocked  as  follows.  For  each  oid  that  is  touched  and  locked  (line  2001),  a 
correct  primary  invokes  a  Fetch-i-Unlock  (line  2003)  followed  by  a  matching  Replay -i-Unlock 
(line  2004)  to  unlock  the  object.  Only  the  primary  invokes  Replay -i-Unlock  (line  2006),  so  Re¬ 
play -i-Unlock  will  succeed.  The  Fetch-i-Unlock  is  invoked  at  the  log  object  with  the  des¬ 
ignated  client  that  locked  the  oid  (line  2002).  Once  any  touched  objects  are  unlocked,  the  Exec 
operation  is  ordered  (line  2007). 

Thus,  if  the  primary  is  correct.  Exec  never  returns  FAILURE.  To  ensure  these  semantics,  replicas 
that  implement  the  manager  object  reject  Exec  operations  that  touch  locked  objects  (not  shown), 
because  such  messages  must  be  from  a  faulty  primary.  If  not  ordered  correctly,  the  Exec  operation 
will  time  out,  and  a  new  primary  will  be  elected,  until  eventually  the  Exec  operation  is  ordered 
properly  such  that  it  completes.  Hence,  either  the  current  primary  will  order  each  operation  cor¬ 
rectly,  or  a  new  primary  will  be  elected  until  the  operation  is  ordered. 

Figure  A.3. 1  describes  the  unlocking  process  as  progressing  sequentially  (lines  2001-2004),  but 
implementations  should  unlock  multiple  objects  concurrently  both  by  invoking  Fetch-i-Unlock 
for  log  objects  with  different  designated  clients,  and  for  modifying  Fetch-i-Unlock  to  unlock 
multiple  oids  at  once.  Furthermore,  there  is  no  reason  to  block  Replay -i-Unlock  operations. 
Exec  operations  that  touch  only  unlocked  oids,  or  FOCK  operations  that  do  not  lock  oids  touched 
by  a  pending  Exec. 
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A.3.3  The  Client 

The  client  implementation  is  shown  in  Figure  A.3.2.  The  client  tries  to  lock  whatever  objects  it 
believes  it  may  find  useful  (lines  2100-2106),  perhaps  based  on  recent  usage.  If  locking  succeeds 
at  the  manager  object  (line  2101),  the  client  imports  the  object  into  its  log  object  (line  2104)  and 
updates  its  list  of  locked  objects  (line  2105).  It  is  important  to  note  that  the  client’s  list  of  locked 
objects  is  its  best  guess  as  to  which  objects  are  locked,  but  the  primary  may  unlock  objects  without 
notifying  the  client.  If  the  client  believes  each  object  touched  by  the  request  is  locked  (line  2107),  it 
invokes  Append  at  the  log  object  (line  2108).  If  Append  succeeds,  the  client  returns  the  response 
(line  2110).  If  the  client  has  not  locked  an  object  that  is  touched  by  a  request,  or  if  Append  fails, 
the  client  invokes  Exec  at  the  manager  object  (line  21 15).  Because  the  manager  object  must  unlock 
all  objects  touched  by  the  request,  the  client  mai'ks  all  touched  objects  as  unlocked  (line  2114). 

In  Figure  A.3.2,  the  request  history  for  Zzyzx  clients  consists  of  requests  and  return  values 
from  Exec  operations  (line  2115)  and  successful  Append  operations  (line  2110),  as  required  by 
Theorem  A.2.14,  which  ensures  that  Zzyzx  is  linearizable.  Also,  locking  objects  is  shown  as  an 
explicit  action  (line  2101),  but  it  could  just  as  well  be  implicit.  If  a  client  accesses  an  object  many 
times  in  a  row  without  other  clients  accessing  the  same  object,  the  manager  could  lock  object  for 
the  client  by  default.  Similarly,  objects  can  be  locked  in  batches  rather  than  one  at  a  time.  And,  of 
course,  a  more  compact  representation  of  the  set  of  objects  than  a  list  could  be  used  (e.g.,  “all  files 
under  the  client’s  home  directory  except  those  under  subdirectory  temp”). 


A.3.4  Liveness 

Because  a  consensus  protocol  cannot  ensure  liveness  in  an  asynchronous  environment  [40],  live¬ 
ness  must  depend  on  assumptions  about  synchrony.  To  ensure  eventual  progress,  primary-based 
protocols  typically  assume  that,  after  some  period  of  time,  message  delays  are  bounded.  If  message 
delays  are  bounded,  then  the  time  a  correct  primary  needs  to  order  operations  will  be  bounded,  as 
well,  because  it  is  proportional  to  message  delay.  Suppose  an  operation  does  not  complete.  Then, 
a  new  primary  will  be  elected  and  the  timeout  increased.  Given  infinitely  many  elections,  a  correct 
primary  will  be  elected  an  infinite  number  of  times,  and  the  timeout  will  approach  infinity.  But,  this 
is  a  contradiction,  because  the  time  needed  by  a  correct  primary  to  order  operations  is  bounded  and 
will  thus  eventually  be  less  than  the  timeout,  allowing  the  correct  primary  to  order  the  operation. 
In  short,  liveness  depends  upon  the  timeout  eventually  increasing  to  a  larger  value  than  the  time 
needed  to  order  operations. 

More  formally,  let  A  be  a  bound  on  message  delay  (accounting  for  needed  retransmissions),  let 
t  represent  elapsed  time,  let  5  be  the  amount  of  time  a  correct  primary  needs  to  order  an  operation 
(6  is  proportional  to  A),  and  let  timeout  be  the  current  value  of  the  timeout.  A  common  synchrony 
assumption  is  partial  synchrony  [37],  which  states  that,  for  possibly  unknown  values  of  A  and  x, 
after  time  x,  message  delay  is  bound  by  A.  If  5  <  timeout,  a  correct  primary  will  be  able  to  order 
operations  such  that  they  complete.  If  6  <  timeout  continues  to  hold  when  t  >  x  for  some  time  x, 
then  any  correct  primary  can  order  operations  after  time  x.  So  long  as  the  timeout  increases  and 
a  new  primary  is  elected  when  an  operation  is  not  ordered,  then,  under  partial  synchrony,  it  is  the 
case  that  each  operation  will  be  ordered  correctly  or  that  a  correct  primary  will  eventually  be  elected 
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with  timeout  >  5  and  t  >  X,  such  that  the  correct  primary  will  order  the  operation.  Either  way,  each 
operation  will  be  ordered  such  that  it  completes,  ensuring  liveness. 

Let  s  be  a  bound  on  the  response  time  for  any  invocation  on  the  log  object  after  time  x. '  Suppose 
the  primary  for  the  manager  object  invokes  Fetch+Unlock  on  log  objects  bounded  by  s  at  most 
k  times  before  ordering  an  operation  (where  k  is  bounded  for  a  bounded  number  of  objects).  Then, 
if  s  •  k  +  5  <  timeout,  the  primary  will  not  time  out  and  the  operation  will  be  ordered  such  that  it 
completes.  Because  timeout  increases  when  a  new  primary  is  elected,  it  will  be  the  case  that  either 
operations  are  being  ordered  correctly,  avoiding  new  elections,  or  that  timeout  will  increase  until 
s  •  k  +  5  <  timeout,  such  that  the  next  correct  primary  can  order  operations,  ensuring  liveness.  The 
following  lemma  states  this  fact  more  formally. 

Lemma  A.3. 1 .  The  manager  object,  modified  such  that  the  primary  orders  operations  and  invokes 
Fetch+Unlock  and  Replay+Unlock  as  to  prevent  Exec  from  returning  failure,  remains 
live  under  partial  synchrony. 


Proof.  Because  the  primary  is  responsible  for  invoking  Replay+Unlock,  each  client  invokes 
only  Exec  and  Lock  operations.  If  an  operation  is  a  LOCK  operation  or  if  it  is  an  Exec  operation 
that  does  not  touch  any  locked  objects,  then  a  correct  primary  can  order  the  operation  in  time 
bounded  by  6.  If  the  Exec  operation  touches  one  or  more  locked  objects,  a  correct  primary  must 
unlock  each  locked  object  by  invoking  Fetch+Unlock.  Each  Fetch+Unlock  will  complete 
in  time  £.  Because  there  are  a  bounded  number  of  objects,  Fetch+Unlock  is  invoked  a  bounded 
number  of  times,  less  than  some  constant  k.  Thus,  a  correct  primary  will  be  able  to  order  the 
operation  in  time  bounded  by  s  •  k  +  6. 

If  an  operation  is  not  ordered,  replicas  will  time  out,  forcing  the  election  of  a  new  primary  and 
increasing  the  timeout.  Because  timeout  increases  upon  leader  election,  but  A  is  fixed  but  unknown 
for  partial  synchr'ony,  and  because  £  and  5  are  proportional  to  A,  the  sum  £  •  k  +  6  is  bounded  and 
will  thus  be  less  than  timeout  after  enough  elections.  Given  infinitely  many  elections,  a  correct 
primary  is  elected  infinitely  many  times.  Thus,  either  operations  will  be  ordered  by  the  current 
primary  such  that  they  complete,  or  after  enough  elections,  a  correct  primary  will  be  elected  and 
£-k  +  6<timeout  such  that  the  correct  primary  can  order  each  operation.  □ 

The  liveness  theorem  follows  immediately  from  the  preceding  lemma. 

Theorem  A. 3. 2.  The  Zzyzx  protocol  is  live,  such  that  each  request  issued  by  a  correct  client  is 
executed  and  returns  a  return  value. 


Proof.  Consider  the  request  flow  in  Figure  A.3.2.  The  client  calls  logobj(...)  (line  2104  and 
line  2108),  which  is  implemented  by  some  replicated  state  machine  protocol  that  is  live  by  assump¬ 
tion.  The  client  also  calls  mgrobj(. . .)  (line  2101  and  line  2115),  which  is  live  by  Lemma  A.3.1. 
Both  are  called  a  finite  number  of  times,  ensuring  that  the  Zzyzx  protocol  is  live.  □ 

'Note  that  completing  an  operation  in  PBFT  or  Zyzzyva  when  the  primary  is  correct  requires  a  small,  bounded  amount 
of  computation  and  a  few  message  delays,  taking  time  proportional  to  A.  A  faulty  primary  will  order  the  operation  before 
timeout  or  a  new  primary  will  be  elected  at  most  a  fixed  number  of  times  before  a  correct  node  is  primary.  Thus,  the  time 
required  to  complete  an  operation  on  a  log  object  if  implemented  by  such  a  protocol  can  be  bounded  by  some  £. 
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A.3.5  Obstruction-Free  Variants 

Zzyzx  is  built  on  top  of  PBFT  or  Zyzzyva,  so  it  is  natural  to  ensure  similar  liveness  semantics. 
But,  it  is  worth  mentioning  that  Byzantine  Locking  has  a  natural  implementation  in  an  obstruction- 
free  [50]  framework  such  as  Q/U  [2].  As  in  Zzyzx,  each  client  issues  Exec,  Append,  Lock, 
and  possibly  IMPORT  operations.  But,  instead  of  the  primary  handling  Fetch-i-Unlock  and  Re¬ 
play -i-Unlock,  each  client  issues  such  commands  to  unlock  objects.  Thus,  a  client  may  find  itself 
repeatedly  unlocking  an  object  in  the  face  of  interference  from  another  client,  but,  as  required  under 
obstruction  freedom,  a  client  can  complete  any  request  when  there  is  no  such  interference. 


Appendix  B 

Zzyzx  Optimizations 


The  log  object  operations  are  constructed  as  to  enable  the  optimized  implementation  in  this  section 
and  in  Sections  5.3  through  5.4.  This  chapter  provides  details  to  bridge  the  implementation  from 
Chapter  5  with  the  formal  model  from  Appendix  A. 

B.l  Faulty  Client  Isolation 

Since  each  log  object  accepts  requests  from  a  single  designated  client,  Byzantine  locking  provides 
isolation  between  faulty  clients.  If  a  client  is  faulty,  the  Replay+Unlock  operation  need  only 
ensure  that  a  correct  client  could  have  issued  the  operations  found  in  the  request  log.  To  do  so, 
Replay+Unlock  stops  replaying  requests  upon  encountering  a  request  that  touches  an  object  that 
the  client  has  not  locked.  Isolating  faulty  clients  provides  significant  design  flexibility  with  which 
to  implement  the  log  object.  For  example,  if  the  client  is  faulty,  Fetch+Unlock  can  return  an 
arbitrary  request  log,  and  Append  and  Import  need  not  even  always  return  a  response. 

B.2  Separating  Unlock  from  Fetch 

This  section  splits  the  Fetch+Unlock  operation  into  three  operations,  UNLOCK,  Fetch,  and 
Next_Vs,  which  enables  the  optimizations  in  Section  B.3.  To  support  UNLOCK,  Fetch,  and 
Next.Vs,  the  log  object  keeps  two  additional  data  structures:  an  unlocking  list,  which  is  a  list  of 
oids,  and  log,  which  is  the  request_log  that  Fetch  returns  for  the  current  vs.  If  oid  G  unlocking,  IM¬ 
PORT  cannot  import  oid,  and  any  Append  that  touches  oid  returns  failure.  Initially,  unlocking  =  0 
and  log  =  NULL.  The  sequential  specification  of  each  operation  is  as  follows: 

(Unlock, oid)  Add  oid  to  unlocking. 

(Fetch, oid)  If  log  ^  null,  return  log.  If  log  =  NULL  and  oid  G  unlocking,  let  log  ^ 
(vs,  oid,  request_log)  and  return  log.  Otherwise,  return  FAILURE. 

(Next.Vs,  vs',oid)  If  vs'  =  VS  +  I,  let  vs  ^  vs  + 1,  state[oid]  ^  NULL,  unlocking  <—  0,  and  log  ^ 
NULL. 
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Isid  .logobj^i  J  (  (Append  .  vs' ,  req  no' ,  req )  ) : 

2300:  if  (vs'  >  vs)  then  get_view(vs')  /*  Implicit  Next.Vs  */ 

2301:  if  (reqno'  =  reqno)  then  return  response 

2302:  if  (reqno'  >  reqno+  1)  then  missed_reqs(reqno  +  1) 

2303:  if  (reqno'  ^  reqno  +  1)  then  return 

2304:  if  ((oids_touched(req)  n  unlocking)  0)  then 

2305:  return  failure 

2306:  else  /*  Execute,  append  to  log,  and  return  */ 

2307:  return  try _request(req) 


lsid.logobj(.|j(  (Unlock,  vs',  old)): 

2400:  if  (vs'  >  vs)  then  get_view(vs')  /*  Implicit  Next.Vs  */ 
2401: 

2402:  /*  Process  the  UNLOCK  operation  */ 

2403:  unlocking  ^  unlocking  U  {old} 

2404: 

2405:  /*  Implicit  FETCH  */ 

2406:  mgrobj(  (Fetch,  cid,  vs,  old,  request  Jog)) 


Figure  B .  3 . 1 :  Log  server  pseudo-code. 


To  unlock  an  object,  the  primary  issues  UNLOCK  followed  by  Fetch  at  the  log  object,  then  Re- 
play+Unlock  at  the  manager  object.  The  primary  issues  Next_Vs  to  clear  log  and  the  unlocking 
list  before  unlocking  the  next  object.^  Because  Fetch  does  not  increment  vs,  the  manager  object 
ensures  during  (Replay+Unlock,  cid,  vs',  *,  *)  that  vs'  =  vsdd,  and  it  sets  vsdd  ^  vsdd  +  1  (rather 
than  ensuring  that  vs'  >  vSdd  and  setting  vSdd  ^  vs',  as  in  Section  A.  1.3). 


B.3  Append,  Unlock,  and  Fetch 

In  Zzyzx,  a  collection  of  3/  +  1  log  servers  implement  a  log  object  that  tolerates  /  faults.  The 
primary  goal  of  the  optimizations  in  this  chapter  is  to  enable  Append  operations  to  complete  in  a 
single  round-trip  (two  one-way  message  delays)  to  2/  -|-  1  log  servers,  which  improves  upon  prior 
results:  Zyzzyva  requires  three  one-way  message  delays  to  3/-|-  1  replicas,  and  PBFT  requires  four 
one-way  message  delays  to  2/  -|-  1  replicas.  The  basic  technique  is  as  follows.  A  client  will  append 
its  request  to  a  request  log  at  each  of  2/  -|-  1  log  servers.  To  unlock  an  oid,  the  primary  will  tell  each 
of  2/ -I-  1  log  servers  to  add  oid  to  the  unlocking  list.  To  fetch,  the  primary  will  query  2/-|-  1  log 
servers  for  the  most  recent  request  log. 

Because  a  successful  Append  will  append  req  to  request_log  at  2/-|- 1  log  servers,  a  subsequent 
Fetch  at  any  2/+  1  log  servers  will  overlap  on  at  least  one  correct  log  server,  which  will  include 
req  in  its  request_log.  As  described  below,  the  manager  object  will  choose  the  longest  of  2/-|-  1 
request  logs,  which  will  include  req.  Because  UNLOCK  for  oid  will  complete  at  2/-|-  1  log  servers, 
a  subsequent  Append  at  2/  +  1  log  servers  that  touches  oid  will  overlap  on  at  least  one  correct 
log  server,  which  will  return  FAILURE.^  If  two  or  more  log  servers  receive  UNLOCK  and  Append 
messages  in  a  different  order,  the  primary  will  use  the  manager  object  to  agree  upon  a  canonical 
request  log  that  log  servers  can  replay  to  reach  a  consistent  state. 

Figure  B.3.1  provides  log  server  pseudo-code  for  the  Append  and  Unlock  operations.  To 
invoke  Append,  a  client  sends  a  signed  request  to  the  log  servers  (the  signature  is  denoted  add  in 
Figure  B.3.1).^  Each  correct  log  server  verifies  that  the  request  is  next  in  sequence  (lines  2301- 

' Recall  from  Section  A.3  that  Zzyzx  uses  the  primary  to  invoke  Fetch+Unlock  and  Replay+Unlock.  In  other 
systems,  clients  could  issue  UNLOCK,  FETCH,  and  Next_Vs  directly,  without  impacting  correctness. 

^Note  that  an  UNLOCK  may  precede  an  APPEND  that  precedes  a  FETCH,  such  that  the  APPEND  returns  FAILURE 
but  req  is  found  in  request_log.  This  operation  order  would  not  be  possible  without  logically  separating  UNLOCK  from 
Fetch  in  Section  B.2. 

^To  handle  requests  that  are  dropped  or  corrupted  by  the  network,  the  client  periodically  retransmits  a  request  until  it 
gets  a  response. 
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2303;  see  Section  B.6.1  for  an  explanation  of  missed_reqs(. . .)).  If  the  request  is  next  in  sequence, 
the  log  server  verifies  that  each  object  touched  by  the  request  is  locked  (not  shown)  and  not  on  the 
unlocking  list  (line  2304).  If  not,  FAILURE  is  returned  (line  2305).  Otherwise,  the  signed  request  is 
appended  to  request_log  and  reqno  is  incremented  (not  shown).  The  request  is  then  executed  and 
the  response  is  returned  (line  2307).  Upon  receiving  2/+  1  successful  responses,  the  client  returns 
the  majority  response. 

The  primary  invokes  the  UNLOCK  operation  by  sending  a  request  to  all  log  servers,  which  add 
the  oid  being  unlocked  to  the  unlocking  list  (line  2403).  UNLOCK  completes  when  2/+  1  log 
servers  respond.  The  primary  could  separately  fetch  the  request-log  from  2/  +  1  log  servers,  but, 
instead.  Fetch  is  issued  concurrently  with  Unlock,  ensuring  that  oid  G  unlocking.  To  ensure  that 
subsequent  invocations  of  Fetch  return  the  same  value  log  for  a  particular  vs,  each  log  server  calls 
the  manager  object  directly  with  its  request-log  (line  2406)."^  The  manager  object  uses  its  replicated 
state  machine  protocol  to  choose  a  canonical  log^-jcj,  as  follows.  Upon  seeing  2/+  1  request  logs 
from  distinct  log  servers  for  cid,  vScid,  and  matching  oid,  the  manager  object  chooses  the  longest 
such  request-log  with  correctly  signed  requests  and  sets  log^^  ^  (oid,  request-log).^  Upon  setting 
logcid’  the  manager  object  can  immediately  execute  (REPLAY+UNLOCK,cid,vs,oid,  request-log). 


B.4  Import 

Rather  than  performing  IMPORT  explicitly,  the  IMPORT  operation  can  be  implicit  by  allowing  the 
log  object  to  lazily  fetch  objects  that  it  is  missing  from  the  manager  object.  Lazily  fetching  objects 
provides  two  main  benefits.  First,  objects  are  copied  directly  from  the  manager  to  the  log  object 
without  passing  through  the  client.  Avoiding  copying  thi'ough  the  client  is  of  particular  benefit  if 
the  servers  that  implement  the  manager  object  and  log  object  are  co-located.  Second,  lazily  fetching 
objects  allows  a  client  to  lock  a  large  number  of  objects  (e.g.,  the  range  of  metadata  objects  that 
describe  each  file  under  a  directory  subfree),  buf  only  objecfs  fhaf  are  used  are  copied. 

To  enable  implicif  imporf,  upon  locking  oid,  fhe  manager  objecf  stores  vScid  in  addifion  to  cid 
(locktable[oid]  <—  (cid, vScid))-  The  value  of  vs^d  represenfs  fhe  earliesf  vs  af  which  an  IMPORT 
is  allowed.  Upon  receiving  an  Append  requesf  fhaf  touches  an  objecf  oid  fhaf  has  nof  been  im- 
porfed,  a  log  server  queries  fhe  manager  objecf  wifh  (vs,  oid)  to  see  if  fhe  clienf  has  locked  fhe 
objecf.  If  locktable[oid]  =  (cid,  vs')  for  some  vs'  <  vs,  fhe  manager  objecf  and  can  infer  fhaf  oid 
was  locked  and  could  have  been  imported  af  vs'.  Thus,  fhe  manager  objecf  refurns  fhe  value  of  oid, 
and  fhe  log  server  imporfs  oid.  Because  NexT-Vs  sefs  state[oid]  <—  NULL  af  fhe  log  server  after 
Replay+Unlock  sefs  locktable[oid]  <—  null  af  fhe  manager  objecf,  af  mosf  one  implicif  IMPORT 
mafches  each  LOCK.  To  ensure  fhaf  vs"  <  vs  for  locktable[oid]  =  (cid,  vs")  af  fhe  manager  objecf 
and  vs  af  fhe  log  server,  upon  locking  an  objecf,  fhe  manager  refurns  vScid^  which  fhe  clienf  propa- 
gafes  as  argumenf  vs'  to  Append.  If  vs'  >  vs  af  fhe  log  server,  fhe  log  server  updafes  vs  (line  2300), 
described  in  Secfion  B.6. 

^To  avoid  circular  dependencies,  the  primary  must  order  operations  from  log  objects  immediately,  unlike  operations 
from  clients  in  Section  A.3.2. 

^logcid  need  not  include  vs^id  because  the  manager  knows  vs^id 
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B.5  Retrying  if  Append  Returns  failure 

As  Append  is  constructed  above,  a  log  server  may  return  failure  for  a  request  that  a  different  log 
server  includes  in  its  request_log.  Thus,  if  any  log  server  returns  FAILURE,  the  client  must  determine 
whether  req  will  be  replayed  at  the  manager  object  or  not.  The  client  does  so  using  a  new  operation 
at  the  manager  object:  (Retry,  cid,  req,  reqno,  nonce).  Retry  requires  the  manager  object  to  store 
a  new  variable,  noncecid,  which  is  initialized  to  0.  If  req  has  already  been  replayed  at  the  manager 
object,  the  manager  object’s  response  to  Retry  tells  the  client  to  wait  for  successful  Append 
responses  at  the  log  object.  If  req  cannot  be  replayed  at  the  manager  object,  the  Retry  operation 
acts  like  an  (Exec, cid,  req)  operation.  In  order  to  prevent  a  Retry  operation  from  executing  an 
operation  that  will  subsequently  be  replayed,  the  primary  provides  proof  to  the  manager  object  that 
it  has  fetched  all  the  most  recent  request_logs,  and  the  manager  object  sets  requo^jd  ^  reqno.  The 
nonce  serves  as  an  inexpensive  proof  that  the  most  recent  request-logs  have  been  fetched.  The  rest 
of  this  section  steps  through  an  invocation  of  the  Retry  operation. 

During  an  Append  operation,  if  any  log  server  returns  failure  before  2/  +  1  log  servers  re¬ 
turn  successfully,  the  client  invokes  mgrobj((RETRY,cid,  req,  reqno,  nonce))  at  the  manager  object, 
where  nonce  is  an  unpredictable  random  nonce  chosen  by  the  client.  The  Retry  invocation  is  au¬ 
thenticated  like  any  other  mgrobj(. . .)  invocation.  Before  ordering  the  Retry  operation,  a  correct 
primary  ensures  that  either  reqnOc-id  >  reqno  or  noncecid  =  nonce,  as  follows.  If  reqnodd  <  reqno, 
the  primary  invokes  (UNLOCK, vsdd, NULL, nonce)  at  the  log  object,  which  is  a  special  form  of 
Unlock  that  does  not  add  anything  to  unlocking  but  passes  nonce  through  to  Fetch.  Log  servers 
respond  by  invoking  mgrobj( (Fetch, cid,  vs, NULL,  request-log,  nonce)).  The  manager  object  ex¬ 
ecutes  Replay -i-Unlock  upon  seeing  2/-|- 1  request  logs  from  distinct  log  servers  for  cid,  vscid^ 
and  matching  oid  and  matching  nonce,  and  it  sets  noncedd  ^  nonce.  If  reqnOdd  <  reqno  even 
after  Replay-i-Unlock,  the  primary  unlocks  objects  touched  by  req,  as  it  would  before  ordering 
(Exec,  cid,  req). 

The  primary  then  orders  (Retry,  cid,  req,  reqno,  nonce).  There  are  three  cases  for  handling  the 
Retry  at  the  manager  object.  First,  if  repnod^j  <  reqno  and  noncecid  7^  nonce,  or  if  reqnodd  < 
reqno  and  noncecid  =  nonce  but  an  oid  that  is  touched  by  req  is  locked,  the  primary  is  faulty 
and  a  new  primary  will  be  elected.  Second,  if  reqnOcid  >  reqno,  then  the  request  was  replayed 
at  the  manager  object,  so  the  manager  object  returns  vScid-  Upon  receiving  vScid,  the  client  in¬ 
fers  that  the  request  completed  at  the  log  object.  The  client  sets  vs  ^  vScid  and  retries  invoking 
(Append,  vs,  reqno,  req)  at  the  log  object.  Propagating  vScid  in  Append  ensures  that  each  correct 
log  server  will  issue  NexT-Vs  (line  2300),  which  will  ensure  that  reqno  completes.  Thus,  the  client 
can  and  does  wait  for  /  -|-  1  such  correct  log  servers  to  provide  matching  responses. 

Third,  if  reqnOdd  <  reqno  and  noncecid  =  nonce,  the  manager  object  sets  reqnOdd  ^  reqno, 
VScid  ^  VScid  +  1,  and  logdd  .request-log[reqno']  ^  NO-OP  for  reqno'  G  {reqnOdd  +  1,  •  ■  •  j  reqno}. 
Because  noncecid  =  nonce,  the  primary  has  proven  that  it  fetched  the  most  recent  request  logs  from 
2/ -I- 1  log  servers  after  the  (Retry, cid,  *,  reqno,  nonce)  operation  was  invoked.  (Assuming  the 
primary  could  not  guess  the  nonce  value,  the  UNLOCK  requests  and  Fetch  responses  that  included 
reqnOcid  must  have  followed  the  Retry  invocation.)  An  (Append,*, reqno',*)  that  succeeded  at 
2/ -I-  1  log  servers  will  overlap  on  at  least  one  correct  log  server,  so  reqno'  will  be  replayed.  Thus, 
the  manager  object  can  pad  request-log  with  NO-OP  requests,  which  do  nothing  other  than  advance 


B.6.  STATE  TRANSFER  AND  NEXT.VS 


99 


reqno,  such  that  the  log  servers  will  advance  to  reqno  upon  the  next  Next_Vs.  The  manager  object 
then  executes  retval  ^  execute _request(cid,  req,  state),  as  it  would  when  invoking  (Exec,  cid,  req), 
and  returns  (vScid,  retval).  The  client  sets  vs  ^  vScid  and  returns  retval  to  the  application.  The 
manager  object  returns  vs^d  such  that  the  client  can  propagate  vScid  in  its  next  Append  request, 
ensuring  the  log  object  will  execute  Next.Vs  and  advance  to  reqno. 

B.6  State  Transfer  and  Next.Vs 

If  a  log  server  falls  behind,  such  that  the  value  of  vs  or  reqno  is  less  that  the  values  vs'  or  reqno'  sent 
in  Append  and  Unlock,  then  it  transfers  state  from  other  servers.  If  reqno  <  reqno'  during  an 
(Append,  reqno',  *),  the  log  server  fetches  requests  from  other  log  servers  (missed_reqs(reqno  + 
1)  on  line  2302).  If  vs  <  vs'  during  (Append, vs', *, *)  or  (Unlock, vs',*),  then  the  log  server 
fetches  a  checkpoint  of  the  state  at  vs'  from  other  log  servers  (get_view(vs')  on  lines  2300  and  2400). 

B.6.1  missed  j-egsO .. ) 

Upon  receiving  (Append,  *,  reqno',  *),  a  log  server  may  find  that  reqno'  >  reqno  +  1  (line  2302) 
because  it  missed  an  Append.  If  the  client  is  correct,  it  completed  Append  for  each  of  reqno  + 
1, . . . ,  reqno'  —  1,  so  /  +  1  correct  log  servers  appended  request  reqno  +  1  to  their  request  logs.  In 
missed_reqs(reqno  +1),  the  log  server  queries  each  log  server  for  request  reqno  +  1.  Upon  finding 
/  +  1  matching  responses,  the  log  server  replays  reqno  +  1.^  The  log  server  repeats  this  process 
until  reqno'  =  reqno  +  1.  Of  course,  in  practice,  log  servers  fetch  multiple  missed  requests  at  once. 

If  the  client  is  not  correct,  missed_reqs(reqno  +  1)  may  not  find  /  +  1  mafching  request  log 
entries  for  request  reqno  +  1,  which  will  prevent  a  faulty  client  from  making  progress.  But,  it 
will  not  block  the  primary  from  completing  UNLOCK  operations  or  other  clients  from  completing 
Append  operations  on  their  log  servers. 

B.6.2  Next_Vs  and  get_view(. . . ) 

This  section  describes  how  a  log  server  could  fetch  checkpoints  from  the  manager  object,  but  it  is 
merely  a  bridge  into  Section  B.6.3,  which  describes  how  to  transfer  state  from  other  log  servers. 

The  Next_Vs  operation  is  issued  implicitly  by  propagating  vScid  as  vs'  in  UNLOCK  and 
Append  requests.  Upon  receiving  (Append, vs,  *,*)  or  (Unlock, vs,  *),  a  log  server  may 
find  fhat  vs  >  vs  (lines  2300  and  2400),  in  which  case  it  calls  get_view(vs)  to  request 
(NEXT_Vs,vs',reqno',state')  from  the  manager  object.  Checkpoint  (vs',  reqno', state')  is  gener¬ 
ated  after  Replay-i-Unlock  by  setting  vs'  ^  vScid  and  reqno'  ^  reqno^y,  and  processing  each  old 
as  follows.  If  locktable[oid]  =  (dd,*),  state'[oid]  <—  state[oid].  Otherwise,  state'[oid]  <—  NULL. 

Upon  receiving  a  checkpoint  (vs',  reqno', state'),  if  vs  >  vs',  the  checkpoint  is  ignored.  Oth¬ 
erwise,  the  log  server  sets  vs  <—  vs'  and  does  the  following.  If  reqno'  <  reqno,  then  for  each  old 
such  that  state' [oid]  =  null,  state[oid]  <—  null  (that  is,  objects  unlocked  at  the  manager  object 
are  unlocked  at  the  log  server).  If  reqno'  >  reqno,  then  reqno  <—  reqno'  and  state  ^  state'. 

^During  replay,  an  implicit  import  may  fail  if  an  object  is  unlocked  at  the  manager  object,  increasing,  vs^id-  Thus,  if 
replay  fails,  the  log  server  calls  get_view(vs+  1). 
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B.6.3  Replaying  Requests  at  the  Log  Object 

Rather  than  replaying  the  request  log  at  the  manager  object,  Zzyzx  replays  the  request  log  at  the 
log  object,  which  is  often  inexpensive  because  log  servers  may  have  already  computed  the  result. 
This  section  describes  replaying  the  request  log  at  the  log  object  rather  than  at  the  manager  object. 
It  does  not  consider  re-using  results  that  the  log  servers  have  already  computed,  which  is  left  for 
Section  B.6.4. 

During  Replay -i-Unlock,  the  manager  object  chooses  the  longest  request_log  and  stores  it 
and  the  oid  being  unlocked  in  log^jj,  as  in  Section  B.3,  but  the  manager  object  does  not  replay 
request_log  or  unlock  oid.  Instead,  upon  choosing  request_log,  the  primary  sends  each  log  server  a 
Replay  request,  which  tells  the  log  server  to  fetch  (vs',oid,  request_log')  from  the  manager  object.^ 
The  manager  object  returns  vScid  and  the  values  of  oid  and  request_log  found  in  log^-jd-  (The  manager 
object  may  also  be  able  to  provide  (vs',oid,  request_log')  directly  to  log  servers  in  its  response  to 
Fetch,  as  in  Section  5.3.3.) 

Upon  receiving  (vs',  oid,  request_log')  from  the  manager  object,  a  log  server  does  as  follows.  If 
vs  >  vs',  the  request  is  stale  and  is  ignored.  If  vs'  >  vs,  the  log  server  fetches  checkpoints/  from 
/  +  1  other  log  servers  with  matching  checkpoints.^  Otherwise,  it  is  the  case  that  vs  =  vs',  so  the  log 
server  lets  (state',  reqno')  <—  checkpoint's  and  replays  request_log',  starting  at  reqno',  on  state'. ^ 
When  replay  completes,  the  log  server  lets  vs  ^  vs  +  1,  obj  <—  state'[oid]  and  state'[oid]  ^  NULL, 
and  it  creates  a  new  checkpoint  for  the  new  value  of  vs,  checkpoints  (state',  reqno"),  where 
reqno"  is  the  last  replayed  request.  The  log  server  sets  state[oid]  ^  NULL,  and,  if  reqno"  >  reqno, 
it  sets  state  <—  state'  and  reqno  <—  reqno".  Finally,  the  log  server  sends  obj  to  the  manager  object, 
which,  upon  receiving  2/+  1  matching  values,  sets  locktable[oid]  <—  NULL,  state[oid]  ^  obj,  and 
'/^cid  ^  '/Scid  T  1- 

B.6.4  Avoiding  Replay 

This  section  considers  how  to  avoid  executing  requests  twice  at  each  log  server,  now  that 
Section  B.6.3  has  moved  replay  to  the  log  object.  Two  request  logs,  request_log'  and 
request-log,  are  said  to  be  consistent  after  request  reqno'  if,  for  reqno  >  reqno',  it  is 
the  case  that  request_log'[reqno]  =  requestJog[reqno]  when  request-log' [reqno]  ^  NULL  and 
request-log[reqno]  7^  NULL.  Section  B.6.3  replays  requestJog'  to  compute  checkpoints  and  obj. 
Suppose  at  some  log  server  that  the  request  log  being  replayed  to  reach  the  next  vs,  request-log', 
is  consistent  with  request-log  after  reqno',  where  request-log  is  the  log  server’s  request  log  and 
reqno'  =  checkpoints-i -reqno. 

Let  reqno"  be  the  last  request  in  request-log'.  If  reqno"  >  reqno,  request-log'  contains  re¬ 
quests  missing  in  requestJog,  so  the  log  server  executes  reqno J-  1,..., reqno"  from  requestJog' 
on  state,  lets  vs  ^  vs  J-  1,  obj  -s—  state[oid],  state[oid]  ^  NULL,  and  creates  checkpoint^j  ^ 

^Similarly,  a  newly  elected  primary  sends  log  servers  REPLAY  requests  for  each  log,-^  that  has  not  been  replayed. 

^Because  2/  +  1  log  servers  will  generate  a  new  checkpoint  for  vs  before  the  manager  object  advances  vs^id  to  vs,  the 
log  server  will  always  find  /  +  1  log  servers  with  matching  checkpoints.  Also,  note  that  standard  techniques  apply,  such 
as  finding  /  +  1  log  servers  with  matching  hash  trees  of  state  before  fetching  state  in  chunks. 

^During  replay,  an  implicit  import  may  fail,  but  only  if  2/  +  1  other  log  servers  completed  REPLAY,  unlocking  a 
required  oid  at  the  manager  object.  Thus,  if  implicit  import  fails,  the  log  server  can  abandon  state'  and  stop  replaying. 
*®request_log  and  request-log'  may  be  inconsistent  if  the  client  is  faulty  or  if  Retry  padded  request-log'  with  NO-OPs. 
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(state,  reqno").  If  reqno"  <  reqno,  then  the  log  server  computes  checkpoint's  as  follows.  It  com¬ 
putes  checkpointreqno  ^  (state,  reqno)  upon  executing  request  reqno  in  request_log.  Then,  because 
executing  requests  is  deterministic,  checkpoint's  =  checkpoint^-gq^o"-  Because  executing  requests 
is  deterministic,  computing  checkpointreqno  for  each  request  need  not  be  expensive.  A  checkpoint 
can  be  computed  from  three  items:  the  previous  checkpoint,  objects  that  have  since  been  imported, 
and  requests  that  have  since  been  executed.  As  in  Section  5.4.1,  full  checkpoints  can  be  computed 
at  fixed  intervals  to  bound  replay.  Furthermore,  if  state[oid]  has  not  been  touched  since  reqno",  the 
log  server  lets  obj  <—  state[oid],  obviating  the  need  for  replaying  request-log'. 

B.6.5  Garbage  Collection 

As  described,  the  request-log  and  checkpoints  at  the  log  server  can  grow  to  consume  a  substan¬ 
tial  amount  of  memory.  This  section  describes  how  to  reclaim  this  memory.  If  checkpoint's  = 
(*,  reqno'),  then  for  reqno  <  reqno'  the  log  server  can  let  requestJog[reqno]  <—  NULL.  If 
request-log  [reqno]  is  ever  requested  by  missed-reqs(reqno)  at  another  log  server,  this  log  server  can 
tell  the  other  log  server  to  fetch  the  checkpoint  for  vs.  Similarly,  upon  computing  a  full  checkpoint 
checkpoint's,  all  prior  checkpoints  can  be  discarded.  If  another  log  server  requests  checkpoint's' 
for  vs'  <  vs,  this  log  server  can  tell  the  other  log  server  to  fetch  the  checkpoint  for  vs. 

If  vScid  is  increasing  at  the  manager  object,  a  log  server  may  fetch  several  checkpoints  without 
finding  /  -|-  1  matching  values.  Note  that  this  problem  will  never  block  UNLOCK,  because,  if  it 
does,  VScid  will  stop  increasing  at  the  manager  object,  such  that  the  log  server  will  eventually  find 
/  -I-  1  matching  checkpoint.  A  log  server  may  find  several  non-matching  checkpoints,  however, 
during  an  Append.  If  Append  is  blocked,  vScid  may  increase,  but  reqnOcjd  will  not  increase.  Thus, 
checkpoints  from  correct  log  servers  will  be  the  same  except  that  checkpoints  for  higher  vs  values 
will  have  fewer  locked  objects.  More  formally,  checkpoint's  =  (state,  reqno)  and  checkpoint's'  = 
(state',  reqno')  are  said  to  be  consistent  if  reqno  =  reqno'  and  for  each  oid  such  that  state[oid] 
NULL  and  state'[oid]  /  NULL,  it  is  the  case  that  state[oid]  =  state'[oid]. 

Upon  receiving  /  -|-  1  consistent  checkpoints,  a  log  server  queries  the  manager  object  for  vScid 
and  a  list  of  oids  locked  by  cid.  If  reqno^id  matches  the  value  of  reqno  in  each  checkpoint,  vScid 
is  greater  than  or  equal  to  the  value  of  vs  for  each  checkpoint,  and  if,  for  each  oid  locked  by  cid, 
the  value  of  oid  matches  in  each  checkpoint,  the  client  constructs  a  checkpoint  for  vs  <—  vScid  as 
follows:  checkpoint's  ^  (reqno', state'),  where  reqno'  is  the  matching  value  in  each  checkpoint, 
state' [oid]  is  the  matching  value  for  locked  oids,  and  state' [oid]  <—  NULL  for  unlocked  oids.  A  log 
server  will  eventually  find  /  -|-  1  such  correcf  log  servers  and  advance  fo  vs,  which  will  allow  the 
Append  operation  to  continue. 


B.7  Using  MACs  Instead  of  Signatures  in  Append 

The  pseudocode  in  Figure  B.3.1  requires  that  Append  requests  are  signed,  such  that  the  manager 
object  can  authenticate  request  logs  when  choosing  the  longest  request-log  to  store  in  logcid-  This 
section  describes  how  to  avoid  signatures.  Suppose  the  client  replaced  the  signature  with  an  authen¬ 
ticator,  which  is  a  vector  of  MACS,  one  for  each  log  server.  The  authenticator  will  allow  each  log 
server  to  verify  the  authenticity  of  an  Append  request,  but  it  cannot  also  be  verified  by  the  manager 
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object.  If  the  longest  request.log  is  proposed  by  /  +  1  log  servers,  the  manager  can  trust  that  it  was 
authenticated  (at  least  one  correct  log  server  proposed  request_log  and  authenticated  each  request), 
but  the  manager  must  also  be  able  to  choose  a  request  log  when  fewer  than  /  +  1  match.  To  do  so, 
log  servers  implement  the  following  voting  scheme. 

To  invoke  Append,  the  client  generates  a  layered  authenticator,  which  is  a  second  vector  of 
MACS  computed  over  the  authenticator.  The  outer  vector  of  MACS  is  the  outer  authenticator,  while 
the  inner  vector  is  the  inner  authenticator.  Upon  receiving  an  Append  request,  a  correct  log  server 
authenticates  the  request  and  the  inner  authenticator  using  the  outer  authenticator.  The  outer  au¬ 
thenticator  is  discarded,  but  the  inner  authenticator  is  appended  to  request_log  in  place  of  the  signa¬ 
ture.  Upon  receiving  UNLOCK,  each  log  server  sends  its  authenticated  request-log  to  the  primary,^^ 
which  accumulates  a  set  of  2/ -I-  1  request-logs  that  it  sends  to  each  log  server.  Upon  receiving 
authenticated  request-logs  from  2/-|-  1  distinct  log  servers,  each  log  server  tests  whether  the  inner 
MACS  properly  authenticate  the  request  in  a  request-log.  Each  log  server  votes  on  the  request-logs 
by  invoking  Fetch  at  the  manager  object,  voting  in  favor  for  each  request-log  where  all  requests 
are  properly  authenticated,  and  voting  against  each  request-log  where  any  request  is  not  properly 
authenticated.  The  manager  object  chooses  the  longest  request-log  with  /  -|-  1  votes  in  favor,  or  the 
empty  log  if  no  request -log  achieves  /-|-  1  votes. 

Further  optimization;  The  outer  authenticator  is  used  by  the  log  server  to  ensure  that  a  request 
is  authentic  before  executing  it  and  appending  it  to  the  request  log,  but  the  isolation  property  dis¬ 
cussed  in  Section  B.l  enables  the  client  to  avoid  generating  the  outer  authenticator  in  mostly  benign 
environments.  If  the  client  does  not  generate  an  outer  authenticator,  the  log  server  can  include  the 
entire  inner  authenticator  in  its  authenticated  response  to  the  client.  If  the  inner  authenticator  is 
corrupted  during  invocation,  the  MAC  in  the  response  will  not  check,  so  the  client  will  generate  the 
outer  authenticator  and  try  again.  Upon  receiving  different  requests  for  the  same  reqno,  the  log 
server  verifies  the  MAC  in  the  outer  authenticator,  possibly  undoing  operation  reqno.  In  its  next 
request,  the  client  includes  the  prior  inner  authenticator  or  its  hash,  such  that  the  log  server  can 
authenticate  the  prior  inner  authenticator. 

Upon  receiving  authenticated  responses  from  2/  -|-  1  log  servers,  a  correct  client  can  infer  that 
at  least  1  correct  log  server  that  has  appended  req  and  the  inner  authenticator  to  its  request  log  will 
participate  in  the  next  Fetch.  Thus,  if  each  reqno  is  voted  on  independently  (rather  than  as  part  of  a 
request-log),  at  least  /-|- 1  correct  log  servers  will  vote  in  favor  for  the  request,  so  the  request  will  be 
reflected  in  the  next  Replay -i-Unlock.  Similarly,  if  a  correct  log  server  calls  missed_reqs(reqno), 
then  /  -I-  1  correct  log  servers  will  return  req. 

Even  further  optimization;  The  log  server  need  not  authenticate  the  prior  inner  authenticator 
until  Fetch  or  missed-reqs(. . .).  Thus,  if  the  client  sends  a  cumulative  mac  as  the  inner  authen¬ 
ticator  of  the  entire  request-log  (or  its  hash),  only  the  cumulative  MAC  need  be  verified.  Thus,  the 
lower  bound  replicated  state  machine  bottleneck  MACS  per  request  in  Figure  5.1.2  is  1,  just  a  single 
MAC.  This  technique  is  primarily  of  theoretical  interest.  To  prevent  a  long  sequence  of  speculative 
requests  (and  potential  undos),  the  log  server  may  wish  to  verify  the  inner  MAC  before  executing 
a  request,  or  at  least  every  few  requests.  The  implementation  evaluated  in  Section  5.5  verifies  the 
inner  MAC  for  each  request. 


"request-log  could  be  signed,  or  log  servers  could  generate  authenticators,  using  signatures  on  failure. 
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